[FFmpeg-devel] what is h264_idct_add8()?

Mon Sep 13 02:24:45 CEST 2010

Hi,

On Sun, Sep 12, 2010 at 8:26 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Fri, Sep 10, 2010 at 09:48:53PM -0400, Ronald S. Bultje wrote:
>> On Mon, Sep 6, 2010 at 4:32 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Mon, Sep 06, 2010 at 12:33:13PM -0400, Ronald S. Bultje wrote:
>> > [...]
>> >> Michael, do you still have the patch that enables using idct_add8()
>> >> for chroma (probably in h264.c) so I can test it performance of
>> >> yasmified idct_add8 against the current code that doesn't use
>> >> idct_add8()?
>> >
>> > i tried a bit of find and grep but it seems iam not looking at the right
>> > place or not searching for the right thing
>>
>> So what do you suggest we do?
>> a) remove the idct_add8() functions from H264DSPContext
>> b) leave as-is (because I can't test the my yasm conversion is correct)
>> c) convert it to yasm along with the rest, hope that it is correct
>> without testing (?)
>> d) something else?
>>
>> (A) is easiest, but (C) may have some benefit if I decide to test the
>> performance benefit in the future with the yasmified version. (B)
>> means duplication of code and thus sounds like a bad plan...
>
> iam against a, i dont care about the rest, mans suggestion is possible too but
> seems much work

I appear to waste too much time on this already, so let's get this
over with. I only did a single measure because the difference is quite
strong (the reason is obviously MMX vs SSE2, along with what you did
earlier to not have to call a vfunc 8 times)

Current SVN:
1838 dezicycles in chroma idct add8, 262111 runs, 33 skips

Using add8 (see attached patch):
1745 dezicycles in chroma idct add8, 262124 runs, 20 skips

add8, SSE2:
1264 dezicycles in chroma idct add8, 262106 runs, 38 skips

My recommendation: we should apply this (along with the rest of my
yasmification).

The rest of the yasmification patch is attached and will have to be
applied with it. I can in all honesty (I measured them all, bleh) say
that no single function is slower in yasm at this point, although that
took a good hack in h264_idct_add16_sse2() (somehow the unroll of the
loop plus inlining of scan8[] makes it a good 20% faster - right now
it's 10 cycles faster than the gcc one, but the not-unrolled one was
20-25% slower than gcc (which unrolls it too)).

Many (+/- half of the) functions are a few (5-30) cycles faster in
yasm, the other half is approximately equal speed. The speedups are
generally in functions where gcc screws up loop conditionals (e.g. for
(x=0;<16;x++) { if (a || b) { .. } }, which it performs horribly at by
creating something like if (!a1) goto end1; { yes1: .. } if (!a2) goto
end2; { yes2: .. } [.. and so on until 16 ..] end1: if (b1) goto yes1;
if (b2) goto yes2; [.. and so on ..]). It's quite hilarious.

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h264_use_add8.patch
Type: application/octet-stream
Size: 1255 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100912/869f6c68/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: yamsify-h264_idct.patch
Type: application/octet-stream
Size: 46039 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100912/869f6c68/attachment-0001.obj>