[Ffmpeg-devel] Fixed vs. Floating Point AAC

Thu Mar 9 17:13:51 CET 2006

Hi

On Thu, Mar 09, 2006 at 09:30:11AM -0500, Rich Felker wrote:
[...]
> Why not compare Athlon? P4 is known to suck...it takes multiple cycles
> just for a bitshift. Even my K6 has 1-cycle MUL/IMUL.

unlikely

> 
> > for the athlon the timings arent clear from the docs i have, only that 
> > 32*32->32 seems 1/4 and 32*32->64 worse then 1/6 if the high value is used
> > and FMUL >=1/4, also note fmul is direct path imul vector path so imul
> > cannot excute with anything else together while fmul can
> 
> Could you explain the p/q notation you're using for throughput?

1/q means that code which does nothing but (I)MULs and they are independant
would need 4cycles per (I)MUL, but the docs wherent clear they dont contain
throughput values, just latency which is meaningless for us as theres plenty
of independant stuff, so i did some guessing and it seems i missguessed a
little, see the benchmarks at the end

> 
> > so i think i provided enough "proof", your only argument seems that low
> > prcission integer tremor is faster then libvorbis, now AFAIK these are
> > 2 different implemenattions, i dont see how a comparission between them has any
> > meaning, i can also compare libavcodecs mp3 decoder which uses integers
> > mostly against the one in mplayer which is mostly floats, you know
> > which is faster ...
> 
> Yes. And no one's ever been able to explain why. But clearly it's

i just explained it, you dont want this explanation but still its whats the
most likely reason

> unrelated to floats, since MPlayer's version is even faster on my K6
> with very slow float.

3dnow ...

argh, why am i wasting my time with this disscussion
IIRC we had this silly integer vs. float disscussion already at least once

so reusing some benchmark proggy here are the results, nicely written and
source attached, feel free to design your own cpu which can do
integer multiplies faster then floatingpoint ones

                        latency throughput
P3
int     32*32    ->32   4       1
int     32*32>>32->32   5.5     1/4.5
float   32*32    ->32   5       1/2

Duron
int     32*32    ->32   4       1/2
int     32*32>>32->32   6       1/4.5
float   32*32    ->32   3.5     1

Athlon
int     32*32    ->32   4       1/2
int     32*32>>32->32   6.5     1/5
float   32*32    ->32   3.5     1

[...]
-- 
Michael
-------------- next part --------------
#include <stdio.h>
#include <asm/timex.h>
#include <inttypes.h>

#define x10(code) code code code code code code code code code code
#define VARS 16

volatile int v[VARS], w[2*VARS];

#define BENCH(code)\
for(i=0; i<10; i++){\
    for(j=0; j<VARS; j++) iv[j]=fv[j]=v[j];\
    t= get_cycles();\
    x10(x10(code))\
    t= get_cycles() - t;\
    for(j=0; j<VARS; j++) {\
        w[j     ]= iv[j]; \
        w[j+VARS]= fv[j];\
    }\
    if(i==9)\
        printf("100 " #code " %5Ld cycles, %5.2f cycles/op\n", t, (t-overhead)/100.0/count);\
}

int main(){
    long long t, overhead=0;
    int i, j;
    int iv[VARS];
    float fv[VARS];
    int count=1;

    BENCH(;)
    overhead= t;
    count=2;
    BENCH(iv[0]+=iv[1];iv[1]+=iv[0];)
    BENCH(iv[0]*=iv[1];iv[1]*=iv[0];)
    BENCH(iv[0]=(iv[0]*(int64_t)iv[1])>>32;iv[1]=(iv[0]*(int64_t)iv[1])>>32;)
    BENCH(fv[0]+=fv[1];fv[1]+=fv[0];)
    BENCH(fv[0]*=fv[1];fv[1]*=fv[0];)
    count=5;
    BENCH(iv[0]+=iv[1];iv[1]+=iv[2];iv[2]+=iv[3];iv[3]+=iv[4];iv[4]+=iv[0];)
    BENCH(iv[0]*=iv[1];iv[1]*=iv[2];iv[2]*=iv[3];iv[3]*=iv[4];iv[4]*=iv[0];)
    BENCH(iv[0]=(iv[0]*(int64_t)iv[1])>>32;iv[1]=(iv[2]*(int64_t)iv[1])>>32;iv[2]=(iv[2]*(int64_t)iv[3])>>32;iv[3]=(iv[3]*(int64_t)iv[4])>>32;iv[4]=(iv[4]*(int64_t)iv[0])>>32;)
    BENCH(fv[0]+=fv[1];fv[1]+=fv[2];fv[2]+=fv[3];fv[3]+=fv[4];fv[4]+=fv[0];)
    BENCH(fv[0]*=fv[1];fv[1]*=fv[2];fv[2]*=fv[3];fv[3]*=fv[4];fv[4]*=fv[0];)

    return 0;
}