[Ffmpeg-devel] Re: fastmemcpy in ffmpeg

Tue Sep 26 18:07:58 CEST 2006

On Tue, Sep 26, 2006 at 12:30:10PM +0200, Gunnar von Boehn wrote:
> >No, multimedia apps profit and everyone else loses. fastmemcpy is
> >several times slower for tiny copies, which are the only thing that
> >_normal_ apps ever do. The only type of memcpy that belongs in libc is
> >the ultra-trivial implementation which (on the i386 family) happens to
> >also be the fastest implementation that works on all cpu generations.
> >Anything like fastmemcpy requires either cpu-specific libc or runtime
> >cpudetect, the former of which is probably not acceptable for most
> >users and the latter of which will be horribly slow for the common
> >cases...
> 
> I have to disagree, politely.
> 
> - A CPU optimized version will easely be faster
>   than the normal version for sizes higher than 64/128 byte.
> 
> - An optimized version will be about twice as fast
>   for sizes higher than 500 byte / 1KB.

Proof???
Anyway keep in mind, many many uses of memcpy are MUCH smaller than
this, sometimes only 4-8 bytes! For example, qsort needs memcpy.

> - The added overhead for all memcpy is just one " if( size>128 ){ "
>   If you tune this branch that it defaults (falls through)
>   to the smaller size routine then you can get this "if"
>   for 1 clock or less on many CPUs. The overhead for this is totally 
> neglectable.

There are already several conditionals for handling unaligned copies
and such. Adding yet more code size means you use more cache lines,
which is unacceptable. Increased code size is the leading cause of
unpredictable performance loss, and core library code that will be
called from all sorts of situations must especially be kept to minimum
size.

> Please mind that the ultra trivial implementation is only
> the fastest implementation for CPUs without any 2nd level cache.
> Its real slow for CPUs with 2nd level cache.

This is not true at all. The trivial implementation is "rep movsd" and
is extremely fast, just not quite as fast as mmx/prefetch tricks.

> If you want to see examples for a very effeciant handling of such cases 
> and how to install optimized routines on runtime then please have a look 
> at the source of MAC OS X.

Sounds like a good place to find very very bad code...

Rich