[FFmpeg-devel] [PATCH] unscaled float 2 int conversion

Fri May 16 13:02:35 CEST 2008

On Fri, May 16, 2008 at 12:34:01PM +0200, Michael Niedermayer wrote:
> On Fri, May 16, 2008 at 09:48:11AM +0300, Siarhei Siamashka wrote:
> > On Friday 16 May 2008, Michael Niedermayer wrote:
> > 
> > [...]
> > 
> > > 2nd try, now it is a P3
> > >
> > > gcc-4.3 -O2 -fno-math-errno
> > > 221951 dezicycles in conv_cast, 16254 runs, 130 skips
> > > 107203 dezicycles in conv_lrint, 16291 runs, 93 skips
> > > 103967 dezicycles in conv_bias, 16286 runs, 98 skips
> > >
> > > gcc-4.2 -O2 -fno-math-errno -lm
> > > 214423 dezicycles in conv_cast, 16250 runs, 134 skips
> > > 114627 dezicycles in conv_lrint, 16325 runs, 59 skips
> > > 53196 dezicycles in conv_bias, 16334 runs, 50 skips
> > >
> > > gcc-4.1 -O2 -fno-math-errno -lm
> > > 212703 dezicycles in conv_cast, 16258 runs, 126 skips
> > > 111271 dezicycles in conv_lrint, 16318 runs, 66 skips
> > > 84831 dezicycles in conv_bias, 16316 runs, 68 skips
> > >
> > > gcc-4.0 -O2 -fno-math-errno -lm
> > > 215119 dezicycles in conv_cast, 16274 runs, 110 skips
> > > 169588 dezicycles in conv_lrint, 16282 runs, 102 skips
> > > 53398 dezicycles in conv_bias, 16338 runs, 46 skips
> > >
> > > gcc-3.4 -O2 -fno-math-errno -lm
> > > 215642 dezicycles in conv_cast, 16221 runs, 163 skips
> > > 105947 dezicycles in conv_lrint, 16318 runs, 66 skips
> > > 48505 dezicycles in conv_bias, 16338 runs, 46 skips
> > >
> > > after a little bit hacking on the code:
> > > 65010 dezicycles in conv_lrint, 16321 runs, 63 skips
> > >
> > > but this is still quite a but slower
> > >
> > > So it seems the bias code is faster on P3(P2/Ppro) cpus
> > > which also means i wont approv its removial unless someone
> > > beats gcc-3.4 -O2 conv_bias on a P3/P2/PPro
> > >
> > > [...]
> > 
> > Please also try to benchmark this alternative code (use of 16-bit FISTP) 
> > on P2/P3/PPro. I did not run extensive tests, but it is was even slower 
> > than lrintf with gcc 4.1 on Pentium-M:
> > 
> > 242987 dezicycles in conv_cast, 16378 runs, 6 skips
> > 40055 dezicycles in conv_lrint, 16382 runs, 2 skips
> > 47085 dezicycles in conv_x87_asm, 16380 runs, 4 skips
> > 866920 dezicycles in conv_x87_asm_ex, 16380 runs, 4 skips
> > 43762 dezicycles in conv_bias, 16376 runs, 8 skips
> 
> P3 gcc-3.4
> 215036 dezicycles in conv_cast, 16363 runs, 21 skips
> 115577 dezicycles in conv_lrint, 16361 runs, 23 skips
> 63010 dezicycles in conv_x87_asm, 16350 runs, 34 skips
> 664136 dezicycles in conv_x87_asm_ex, 16380 runs, 4 skips
> 48501 dezicycles in conv_bias, 16357 runs, 27 skips
> 
> And at that point i found a little bug in the benchmark, it should have been
> in[i]= i + i*i*0.3 - 32780;
> 
> with that its:
> 228574 dezicycles in conv_cast, 16363 runs, 21 skips
> 107110 dezicycles in conv_lrint, 16359 runs, 25 skips
> 62921 dezicycles in conv_x87_asm, 16357 runs, 27 skips
> 58373 dezicycles in conv_x87_asm_ex, 16355 runs, 29 skips
> 43850 dezicycles in conv_bias, 16352 runs, 32 skips

    src += len;
    dst += len;
    len= - 2*len;
    __asm__ __volatile__(
        "finit\n\t" /* dirty hack to disable floating point exceptions */
        "flds    f32767\n\t"
        "flds    fminus32768\n\t"
    "1:\n\t"
        "flds     -4(%[src],%[len],2)\n\t"
        "flds       (%[src],%[len],2)\n\t"
        "flds      4(%[src],%[len],2)\n\t"
        "flds      8(%[src],%[len],2)\n\t"
        "fcomi    %%st(5), %%st(0)\n\t"
        "fcmovnbe %%st(5), %%st(0)\n\t"
        "fxch %%st(2)\n\t"
        "fcomi    %%st(5), %%st(0)\n\t"
        "fcmovnbe %%st(5), %%st(0)\n\t"
        "fxch %%st(1)\n\t"
        "fcomi    %%st(5), %%st(0)\n\t"
        "fcmovnbe %%st(5), %%st(0)\n\t"
        "fxch %%st(3)\n\t"
        "fcomi    %%st(5), %%st(0)\n\t"
        "fcmovnbe %%st(5), %%st(0)\n\t"
        "fistps   -2(%[dst],%[len])\n\t"
        "fistps    0(%[dst],%[len])\n\t"
        "fxch %%st(1)\n\t"
        "fistps    2(%[dst],%[len])\n\t"
        "fistps    4(%[dst],%[len])\n\t"
        "add      $8, %[len]\n\t"
        "jnz      1b\n\t"
        "ffree    %%st(0)\n\t"
        "fincstp\n\t"
        "ffree    %%st(0)\n\t"
        "fincstp\n\t"
        : [dst] "+&r" (dst), [src] "+&r" (src), [len] "+&r" (len)
        :
        : "cc", "memory");
51606 dezicycles in conv_x87_asm_ex, 16354 runs, 30 skips

but thats still quite a bit behind the bias code (which we did not try to 
optimize at all ...)

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I hate to see young programmers poisoned by the kind of thinking
Ulrich Drepper puts forward since it is simply too narrow -- Roman Shaposhnik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080516/a2ac4713/attachment.pgp>