[Ffmpeg-devel] fastmemcpy in ffmpeg

Wed Sep 27 14:05:00 CEST 2006

Hi Ulrich,

Ulrich von Zadow wrote:

> I couldn't find the optimized intel code in the downloads - is it
> possible to get it?

Sure, for a start here is AMDs recommmended memcpy routine:
AMDs routine is very good and they even explain the pros and cons a bit.
http://www.greyhound-data.com/gunnar/glibc/memcpy_amd.cpp

I'll try to explain some of the reasons a bit more in detail.
A trivial memcpy routine like this one is bad for a number of reasons.
Example Trivial Byte Copy
void byte_copy(char *source,char *destination, int size) {
     int j;
     for (j=0; j < size; j++) source[j] = destination[j];
}

The main drawbacks are:

- Because of the way a typical copy back cache architecture works,
the destination data will first be read in the CPU only to be discarded 
and immediatly over written. So if you are copying  1 KB from adress $A 
to $B, then the CPU will in fact load 1 KB from $A and load 1 KB from $B 
and save 1 KB to $B.
So you are transfering 3 KB over the bus instead 2 KB.

- The memory bus will be badly used.
A trivial memcpy will usually restrict itself by read stalls causing 
'bubbles' on the memory bus. If you stream the source data some cache 
lines in in advance you will usually achive up to 3 times better memory 
read performance and up to 2 times better memory copy performance.
A few CPUs are very smart and will auto innitia ahead memory streaming.
For example the PPC970 (G5) can do this. It will always stream with or 
without manual innition of streaming. For the majority of CPUs (like AMD 
ATHLON family or PPC 7400 G4) the streaming by software will double the 
bus performance.

- Cache trashing, a trivial memcpy will overwrite your CPUs first level 
data cache and your 2nd level cache. If you are copying bigger chunks of 
data, which you don't need to work on immidiately after the copy, like 
copying to a network device or to a GFXcard, then this cache trashing 
has a very negative effect on the system performance. The (negative) 
effect of the cache trashing is easely overlooked when you just 
benchmark the execution time of the memcpy.

If you benchmark different routines, please make sure to run them under 
the same conditions. Some routines take more advantage from a hot cache 
than others. So don't forget to set up the CPU cache in a realistic way 
  or to clean it before testing each routine. Sometimes people forget 
this when benchmarking. :-)

I hope that the link is helpfull for you.

Cheers
Gunnar