[Ffmpeg-devel] Re: fastmemcpy in ffmpeg

Wed Sep 27 12:49:16 CEST 2006

Rich,

You seem take this technical question personal.
Please don't do this.

I was stating known facts, you can take them or not.
You don't have to believe me you can verify all with a CPU manual
or you can get the same numbers from companies like AMD or INTEL.

Rich Felker wrote:
> On Tue, Sep 26, 2006 at 12:30:10PM +0200, Gunnar von Boehn wrote:
> 
>>>No, multimedia apps profit and everyone else loses. fastmemcpy is
>>>several times slower for tiny copies, which are the only thing that
>>>_normal_ apps ever do. The only type of memcpy that belongs in libc is
>>>the ultra-trivial implementation which (on the i386 family) happens to
>>>also be the fastest implementation that works on all cpu generations.
>>>Anything like fastmemcpy requires either cpu-specific libc or runtime
>>>cpudetect, the former of which is probably not acceptable for most
>>>users and the latter of which will be horribly slow for the common
>>>cases...
>>
>>I have to disagree, politely.
>>
>>- A CPU optimized version will easely be faster
>>  than the normal version for sizes higher than 64/128 byte.
>>
>>- An optimized version will be about twice as fast
>>  for sizes higher than 500 byte / 1KB.
> 
> 
> Proof???

Don't get silly.
For a start, I'e written benchmarks for several CPUs in the regard. But 
you don't need to trust me, take a look at the recommendations of AMD 
and INTEL.
Here is a link to AMD's recommended memcpy:
http://www.greyhound-data.com/gunnar/glibc/memcpy_amd.cpp

> Anyway keep in mind, many many uses of memcpy are MUCH smaller than
> this, sometimes only 4-8 bytes! For example, qsort needs memcpy.

Yes, the typical usage of memcpy ranges from 1 byte to many Megabytes.
Mind that Memcpy as routine has a setup cost and a cost per copied byte. 
If you increase the setup cost by 10% but reduce the cost by byte per 
50% then this might hurt performance for tiny copies of 1-8 bytes but it 
will give you a big advantage on longer copies.

>>- The added overhead for all memcpy is just one " if( size>128 ){ "
>>  If you tune this branch that it defaults (falls through)
>>  to the smaller size routine then you can get this "if"
>>  for 1 clock or less on many CPUs. The overhead for this is totally 
>>neglectable.
> 
> 
> There are already several conditionals for handling unaligned copies
> and such. Adding yet more code size means you use more cache lines,
> which is unacceptable. Increased code size is the leading cause of
> unpredictable performance loss, and core library code that will be
> called from all sorts of situations must especially be kept to minimum
> size.

While code size is a factor you are forgetting two thinks:

A CPU optimized memcpy might increase code size of the core library by 
something in the range of 16-64 bytes. But what impact has loading some 
extra 32 byte of code when you are copying 1 KB, 10 KB or 1 MB of data?
Answer: None

An optimized streaming memcpy will make more than up for the extra code 
lines, as it is not suffering from memory read stalls and bus bubles as 
a naive routine always does. This will make a 100% difference in speed 
for longer copies.

A trivial copy, like you favor, has the side effect of trashing the 2nd 
level cache. So you will loose your data and code cache which of course 
will have a very negative impact on overall performance.

>>Please mind that the ultra trivial implementation is only
>>the fastest implementation for CPUs without any 2nd level cache.
>>Its real slow for CPUs with 2nd level cache.
> 
> 
> This is not true at all. The trivial implementation is "rep movsd" and
> is extremely fast, just not quite as fast as mmx/prefetch tricks.

For tiny copies smaller than ONE CPU cache line! - movsd is good.
Its a fact, that for bigger copies (of 1 KB or more) you will only 
achive about 25%-60% of the performance of an streaming optimized copy.

But you don't need to believe me, simply ask or trust AMD or INTEL or 
IBM on these numbers.

Whats the point in buying memory with DDR400 or faster
if your memcpy cripples the bus transmission rates to PC133 speed?

>>If you want to see examples for a very effeciant handling of such cases 
>>and how to install optimized routines on runtime then please have a look 
>>at the source of MAC OS X.
> 
> 
> Sounds like a good place to find very very bad code...

I appreciate your sense of humor. ;-)

Cheers
Gunnar