[FFmpeg-devel] [Issue 664] [PATCH] Fix AAC PNS Scaling

Uoti Urpala uoti.urpala
Wed Oct 8 06:02:22 CEST 2008


On Wed, 2008-10-08 at 04:30 +0200, Michael Niedermayer wrote:
> On Wed, Oct 08, 2008 at 04:45:52AM +0300, Uoti Urpala wrote:
> > I tested a simple loop doing "sum += 1/sqrtf(i)" on core2. As expected,
> > "1./ff_sqrt(i)" is the slowest way to calculate that. Standard
> > "1./sqrtf(i)" is equally fast with default flags and somewhat faster
> > with -ffast-math (and has better accuracy). The code from Alex is about
> > twice as fast.
> 
> on a Pentium Dual  @ 1.73GHz
> 
> ff_sqrt() is as expected much faster than sqrtf(), iam rather surprised
> about your results, maybe you could post your test code?

> 443217610 dezicycles in ff_sqrt, 64 runs, 0 skips
> 664184340 dezicycles in sqrtf, 64 runs, 0 skips
> 322520999 dezicycles in sqrt alex, 64 runs, 0 skips
> 3318.429688 3318.004883 3314.035156
>
> one also can see here that alex code is about a factor of 10 less accurate
> also one has to keep in mind that these are synthetic tests and we really
> should be testing with the AAC code.

Note that the final results here are all completely wrong; a float runs
out of precision with 100 repetitions of the loop and stops accumulating
the smaller values. The overall accuracy variation with double sums is
still similar, but starting each loop from 2 rather than 1 would give a
lot worse result for ff_sqrt (it calculates 1/sqrt(1) exactly, but
returning 1 for 1/sqrt(2) is a big error).

I benchmarked your code with both float and double sum variables. Result
with floats:

392916330 dezicycles in ff_sqrt, 1 runs, 0 skips
381062230 dezicycles in sqrtf, 1 runs, 0 skips
322693660 dezicycles in sqrt alex, 1 runs, 0 skips
393098055 dezicycles in ff_sqrt, 2 runs, 0 skips
380917730 dezicycles in sqrtf, 2 runs, 0 skips
322906540 dezicycles in sqrt alex, 2 runs, 0 skips
393145080 dezicycles in ff_sqrt, 4 runs, 0 skips
380681745 dezicycles in sqrtf, 4 runs, 0 skips
322720327 dezicycles in sqrt alex, 4 runs, 0 skips
393035408 dezicycles in ff_sqrt, 8 runs, 0 skips
380493788 dezicycles in sqrtf, 8 runs, 0 skips
322382240 dezicycles in sqrt alex, 8 runs, 0 skips
393197536 dezicycles in ff_sqrt, 16 runs, 0 skips
380411030 dezicycles in sqrtf, 16 runs, 0 skips
322413626 dezicycles in sqrt alex, 16 runs, 0 skips
393182715 dezicycles in ff_sqrt, 32 runs, 0 skips
380354178 dezicycles in sqrtf, 32 runs, 0 skips
322441543 dezicycles in sqrt alex, 32 runs, 0 skips
393194414 dezicycles in ff_sqrt, 64 runs, 0 skips
380323271 dezicycles in sqrtf, 64 runs, 0 skips
322219995 dezicycles in sqrt alex, 64 runs, 0 skips
3318.429688 3318.004883 3314.035156

Result with double sum variables (but still sqrtf, not sqrt):

312860180 dezicycles in ff_sqrt, 1 runs, 0 skips
130667270 dezicycles in sqrtf, 1 runs, 0 skips
327645670 dezicycles in sqrt alex, 1 runs, 0 skips
313082415 dezicycles in ff_sqrt, 2 runs, 0 skips
130357360 dezicycles in sqrtf, 2 runs, 0 skips
327654430 dezicycles in sqrt alex, 2 runs, 0 skips
312881092 dezicycles in ff_sqrt, 4 runs, 0 skips
130205297 dezicycles in sqrtf, 4 runs, 0 skips
327785202 dezicycles in sqrt alex, 4 runs, 0 skips
312870742 dezicycles in ff_sqrt, 8 runs, 0 skips
130198391 dezicycles in sqrtf, 8 runs, 0 skips
327706620 dezicycles in sqrt alex, 8 runs, 0 skips
312766372 dezicycles in ff_sqrt, 16 runs, 0 skips
130203895 dezicycles in sqrtf, 16 runs, 0 skips
327761492 dezicycles in sqrt alex, 16 runs, 0 skips
312736582 dezicycles in ff_sqrt, 32 runs, 0 skips
130194519 dezicycles in sqrtf, 32 runs, 0 skips
327788920 dezicycles in sqrt alex, 32 runs, 0 skips
312743900 dezicycles in ff_sqrt, 64 runs, 0 skips
130205690 dezicycles in sqrtf, 64 runs, 0 skips
327773023 dezicycles in sqrt alex, 64 runs, 0 skips
6420.692419 6419.931566 6413.858814

Two out of the three versions are actually faster when using doubles for
the sums, the sqrtf version several times so! GCC optimization seems to
succeed better in this case for some reason: with float sums it
calculates one square root at a time with the sqrtss instruction, but
with double sums it parallelizes the loop and calculates 4 square roots
with one sqrtps instruction.

-ffast-math makes quite a big difference for this program. Without it
the double version gets these times (float sums are still slower):

330678020 dezicycles in ff_sqrt, 64 runs, 0 skips
450401903 dezicycles in sqrtf, 64 runs, 0 skips
250199530 dezicycles in sqrt alex, 64 runs, 0 skips
6420.692419 6419.931566 6413.858813

ff_sqrt gets a bit slower, sqrtf gets a lot slower so that it's now
slower than ff_sqrt, but sqrt_alex actually gets over 30% faster when
NOT using -ffast-math.





More information about the ffmpeg-devel mailing list