[FFmpeg-devel] YMM registers with AMD

James Almer jamrial at gmail.com
Sun Feb 9 03:13:52 CET 2014


I was making some test today and among them i checked the performance of fft-test.
I surprisingly got these results on an AMD FX 6300:

$ libavcodec/fft-test.exe -s
FFT 512 test
Checking...
max:0.000008 e:1.25487e-006
Speed test...
113550 decicycles in fft_calc, 1 runs, 0 skips
148390 decicycles in fft_calc, 2 runs, 0 skips
138262 decicycles in fft_calc, 4 runs, 0 skips
121755 decicycles in fft_calc, 8 runs, 0 skips
112749 decicycles in fft_calc, 16 runs, 0 skips
108126 decicycles in fft_calc, 32 runs, 0 skips
108488 decicycles in fft_calc, 64 runs, 0 skips
105902 decicycles in fft_calc, 128 runs, 0 skips
104645 decicycles in fft_calc, 256 runs, 0 skips
103981 decicycles in fft_calc, 512 runs, 0 skips
103664 decicycles in fft_calc, 1024 runs, 0 skips
103484 decicycles in fft_calc, 2047 runs, 1 skips
103562 decicycles in fft_calc, 4095 runs, 1 skips
103536 decicycles in fft_calc, 8191 runs, 1 skips
103574 decicycles in fft_calc, 16383 runs, 1 skips
103526 decicycles in fft_calc, 32767 runs, 1 skips
103487 decicycles in fft_calc, 65535 runs, 1 skips
103464 decicycles in fft_calc, 131071 runs, 1 skips
103395 decicycles in fft_calc, 262142 runs, 2 skips
103387 decicycles in fft_calc, 524286 runs, 2 skips
time: 3.2 us/transform [total time=1.65 s its=524288]

$ libavcodec/fft-test.exe -s -c -avx
FFT 512 test
Checking...
max:0.000008 e:1.25487e-006
Speed test...
64210 decicycles in fft_calc, 1 runs, 0 skips
62365 decicycles in fft_calc, 2 runs, 0 skips
59057 decicycles in fft_calc, 4 runs, 0 skips
55721 decicycles in fft_calc, 8 runs, 0 skips
61591 decicycles in fft_calc, 16 runs, 0 skips
56912 decicycles in fft_calc, 32 runs, 0 skips
54546 decicycles in fft_calc, 64 runs, 0 skips
52817 decicycles in fft_calc, 128 runs, 0 skips
52512 decicycles in fft_calc, 256 runs, 0 skips
51837 decicycles in fft_calc, 512 runs, 0 skips
51416 decicycles in fft_calc, 1024 runs, 0 skips
51276 decicycles in fft_calc, 2048 runs, 0 skips
51332 decicycles in fft_calc, 4095 runs, 1 skips
51665 decicycles in fft_calc, 8190 runs, 2 skips
51757 decicycles in fft_calc, 16381 runs, 3 skips
51902 decicycles in fft_calc, 32763 runs, 5 skips
51702 decicycles in fft_calc, 65531 runs, 5 skips
51581 decicycles in fft_calc, 131067 runs, 5 skips
51692 decicycles in fft_calc, 262133 runs, 11 skips
51789 decicycles in fft_calc, 524275 runs, 13 skips
51621 decicycles in fft_calc, 1048548 runs, 28 skips
time: 1.7 us/transform [total time=1.76 s its=1048576]

The latter is with AVX disabled, which means it's running the SSE version of 
fft_calc, yet it's seemingly twice as fast.
The same test ran on a Sandy Bridge gave me about 32000 decicycles for the SSE 
version and as expected about 18000 for the AVX one.

At first i thought there was something really wrong going on with this AMD box, 
but then i remembered Jason Garrett-Glaser mentioned some years ago that 
Bulldozer lacks 256-bit execution units even though they support AVX.
This FX 6300 i used is a Piledriver (Bulldozer's first refresh), and judging 
by the results above the situation seemingly hasn't changed.
No idea about Steamroller, which was recently released.

We don't seem to use YMM registers in many functions (fft, and some float stuff 
from swsresample, avresample, lavfi and lavu) so this isn't much of a problem, 
but it may be a good idea introducing an AV_CPU_FLAG_AVXSLOW flag to disable 
the avx version of functions using YMM registers if there's also one using XMM 
and is confirmed to be faster on these CPUs.

Any comments?


More information about the ffmpeg-devel mailing list