[FFmpeg-devel] what is h264_idct_add8()?

Tue Sep 14 00:03:23 CEST 2010

Hi,

On Mon, Sep 13, 2010 at 5:26 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Sun, Sep 12, 2010 at 08:24:45PM -0400, Ronald S. Bultje wrote:
>> Hi,
>>
>> On Sun, Sep 12, 2010 at 8:26 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Fri, Sep 10, 2010 at 09:48:53PM -0400, Ronald S. Bultje wrote:
>> >> On Mon, Sep 6, 2010 at 4:32 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> > On Mon, Sep 06, 2010 at 12:33:13PM -0400, Ronald S. Bultje wrote:
>> >> > [...]
>> >> >> Michael, do you still have the patch that enables using idct_add8()
>> >> >> for chroma (probably in h264.c) so I can test it performance of
>> >> >> yasmified idct_add8 against the current code that doesn't use
>> >> >> idct_add8()?
>> >> >
>> >> > i tried a bit of find and grep but it seems iam not looking at the right
>> >> > place or not searching for the right thing
>> >>
>> >> So what do you suggest we do?
>> >> a) remove the idct_add8() functions from H264DSPContext
>> >> b) leave as-is (because I can't test the my yasm conversion is correct)
>> >> c) convert it to yasm along with the rest, hope that it is correct
>> >> without testing (?)
>> >> d) something else?
>> >>
>> >> (A) is easiest, but (C) may have some benefit if I decide to test the
>> >> performance benefit in the future with the yasmified version. (B)
>> >> means duplication of code and thus sounds like a bad plan...
>> >
>> > iam against a, i dont care about the rest, mans suggestion is possible too but
>> > seems much work
>>
>> I appear to waste too much time on this already, so let's get this
>> over with. I only did a single measure because the difference is quite
>> strong (the reason is obviously MMX vs SSE2, along with what you did
>> earlier to not have to call a vfunc 8 times)
>>
>> Current SVN:
>> 1838 dezicycles in chroma idct add8, 262111 runs, 33 skips
>>
>> Using add8 (see attached patch):
>> 1745 dezicycles in chroma idct add8, 262124 runs, 20 skips
>>
>> add8, SSE2:
>> 1264 dezicycles in chroma idct add8, 262106 runs, 38 skips
>>
>> My recommendation: we should apply this (along with the rest of my
>> yasmification).
>>
>> The rest of the yasmification patch is attached and will have to be
>> applied with it. I can in all honesty (I measured them all, bleh) say
>> that no single function is slower in yasm at this point, although that
>> took a good hack in h264_idct_add16_sse2() (somehow the unroll of the
>> loop plus inlining of scan8[] makes it a good 20% faster - right now
>> it's 10 cycles faster than the gcc one, but the not-unrolled one was
>> 20-25% slower than gcc (which unrolls it too)).
>>
>> Many (+/- half of the) functions are a few (5-30) cycles faster in
>> yasm, the other half is approximately equal speed. The speedups are
>> generally in functions where gcc screws up loop conditionals (e.g. for
>> (x=0;<16;x++) { if (a || b) { .. } }, which it performs horribly at by
>> creating something like if (!a1) goto end1; { yes1: .. } if (!a2) goto
>> end2; { yes2: .. } [.. and so on until 16 ..] end1: if (b1) goto yes1;
>> if (b2) goto yes2; [.. and so on ..]). It's quite hilarious.
>>
>> Ronald
>
>> ?h264.c | ? ?8 ++++++++
>> ?1 file changed, 8 insertions(+)
>> b89da7914f847f12bbd9c9ca547deedafe4f6326 ?h264_use_add8.patch
>
> if its faster (also time ./ffmpeg) and someone looked over the code
> then ive no objections

everything on core i7 OSX 10.6 cathedral sample:

time ffmpeg (x86-64) after:
9.393
9.468
9.353

before:
9.411
9.537
9.649

time ffmpeg (x86-32) after
10.110
10.143
10.098

x86-32 before
10.161
10.154
10.210

decode_mb START/STOP_TIMER before x86-32:
8453 dezicycles in decode_mb, 4192657 runs, 1647 skips
8462 dezicycles in decode_mb, 4192564 runs, 1740 skips
8439 dezicycles in decode_mb, 4192540 runs, 1764 skips

after x86-32:
8371 dezicycles in decode_mb, 4192574 runs, 1730 skips
8384 dezicycles in decode_mb, 4192549 runs, 1755 skips
8375 dezicycles in decode_mb, 4192546 runs, 1758 skips

decode_mb START/STOP_TIMER before x86-64:
7617 dezicycles in decode_mb, 4192592 runs, 1712 skips
7594 dezicycles in decode_mb, 4192654 runs, 1650 skips
7610 dezicycles in decode_mb, 4192527 runs, 1777 skips

after x86-64:
7524 dezicycles in decode_mb, 4192683 runs, 1621 skips
7587 dezicycles in decode_mb, 4192043 runs, 2261 skips
7528 dezicycles in decode_mb, 4192627 runs, 1677 skips

Will apply tomorrow if nobody objects.

Ronald