[Ffmpeg-devel] fastmemcpy in ffmpeg
Gunnar von Boehn
gunnar
Wed Sep 27 14:05:00 CEST 2006
Hi Ulrich,
Ulrich von Zadow wrote:
> I couldn't find the optimized intel code in the downloads - is it
> possible to get it?
Sure, for a start here is AMDs recommmended memcpy routine:
AMDs routine is very good and they even explain the pros and cons a bit.
http://www.greyhound-data.com/gunnar/glibc/memcpy_amd.cpp
I'll try to explain some of the reasons a bit more in detail.
A trivial memcpy routine like this one is bad for a number of reasons.
Example Trivial Byte Copy
void byte_copy(char *source,char *destination, int size) {
int j;
for (j=0; j < size; j++) source[j] = destination[j];
}
The main drawbacks are:
- Because of the way a typical copy back cache architecture works,
the destination data will first be read in the CPU only to be discarded
and immediatly over written. So if you are copying 1 KB from adress $A
to $B, then the CPU will in fact load 1 KB from $A and load 1 KB from $B
and save 1 KB to $B.
So you are transfering 3 KB over the bus instead 2 KB.
- The memory bus will be badly used.
A trivial memcpy will usually restrict itself by read stalls causing
'bubbles' on the memory bus. If you stream the source data some cache
lines in in advance you will usually achive up to 3 times better memory
read performance and up to 2 times better memory copy performance.
A few CPUs are very smart and will auto innitia ahead memory streaming.
For example the PPC970 (G5) can do this. It will always stream with or
without manual innition of streaming. For the majority of CPUs (like AMD
ATHLON family or PPC 7400 G4) the streaming by software will double the
bus performance.
- Cache trashing, a trivial memcpy will overwrite your CPUs first level
data cache and your 2nd level cache. If you are copying bigger chunks of
data, which you don't need to work on immidiately after the copy, like
copying to a network device or to a GFXcard, then this cache trashing
has a very negative effect on the system performance. The (negative)
effect of the cache trashing is easely overlooked when you just
benchmark the execution time of the memcpy.
If you benchmark different routines, please make sure to run them under
the same conditions. Some routines take more advantage from a hot cache
than others. So don't forget to set up the CPU cache in a realistic way
or to clean it before testing each routine. Sometimes people forget
this when benchmarking. :-)
I hope that the link is helpfull for you.
Cheers
Gunnar
More information about the ffmpeg-devel
mailing list