[FFmpeg-devel] what is h264_idct_add8()?

Mon Sep 13 23:26:22 CEST 2010

On Sun, Sep 12, 2010 at 08:24:45PM -0400, Ronald S. Bultje wrote:
> Hi,
> 
> On Sun, Sep 12, 2010 at 8:26 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> > On Fri, Sep 10, 2010 at 09:48:53PM -0400, Ronald S. Bultje wrote:
> >> On Mon, Sep 6, 2010 at 4:32 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> >> > On Mon, Sep 06, 2010 at 12:33:13PM -0400, Ronald S. Bultje wrote:
> >> > [...]
> >> >> Michael, do you still have the patch that enables using idct_add8()
> >> >> for chroma (probably in h264.c) so I can test it performance of
> >> >> yasmified idct_add8 against the current code that doesn't use
> >> >> idct_add8()?
> >> >
> >> > i tried a bit of find and grep but it seems iam not looking at the right
> >> > place or not searching for the right thing
> >>
> >> So what do you suggest we do?
> >> a) remove the idct_add8() functions from H264DSPContext
> >> b) leave as-is (because I can't test the my yasm conversion is correct)
> >> c) convert it to yasm along with the rest, hope that it is correct
> >> without testing (?)
> >> d) something else?
> >>
> >> (A) is easiest, but (C) may have some benefit if I decide to test the
> >> performance benefit in the future with the yasmified version. (B)
> >> means duplication of code and thus sounds like a bad plan...
> >
> > iam against a, i dont care about the rest, mans suggestion is possible too but
> > seems much work
> 
> I appear to waste too much time on this already, so let's get this
> over with. I only did a single measure because the difference is quite
> strong (the reason is obviously MMX vs SSE2, along with what you did
> earlier to not have to call a vfunc 8 times)
> 
> Current SVN:
> 1838 dezicycles in chroma idct add8, 262111 runs, 33 skips
> 
> Using add8 (see attached patch):
> 1745 dezicycles in chroma idct add8, 262124 runs, 20 skips
> 
> add8, SSE2:
> 1264 dezicycles in chroma idct add8, 262106 runs, 38 skips
> 
> My recommendation: we should apply this (along with the rest of my
> yasmification).
> 
> The rest of the yasmification patch is attached and will have to be
> applied with it. I can in all honesty (I measured them all, bleh) say
> that no single function is slower in yasm at this point, although that
> took a good hack in h264_idct_add16_sse2() (somehow the unroll of the
> loop plus inlining of scan8[] makes it a good 20% faster - right now
> it's 10 cycles faster than the gcc one, but the not-unrolled one was
> 20-25% slower than gcc (which unrolls it too)).
> 
> Many (+/- half of the) functions are a few (5-30) cycles faster in
> yasm, the other half is approximately equal speed. The speedups are
> generally in functions where gcc screws up loop conditionals (e.g. for
> (x=0;<16;x++) { if (a || b) { .. } }, which it performs horribly at by
> creating something like if (!a1) goto end1; { yes1: .. } if (!a2) goto
> end2; { yes2: .. } [.. and so on until 16 ..] end1: if (b1) goto yes1;
> if (b2) goto yes2; [.. and so on ..]). It's quite hilarious.
> 
> Ronald

>  h264.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> b89da7914f847f12bbd9c9ca547deedafe4f6326  h264_use_add8.patch

if its faster (also time ./ffmpeg) and someone looked over the code
then ive no objections
this also applies to other h264 asm optims    

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

In fact, the RIAA has been known to suggest that students drop out
of college or go to community college in order to be able to afford
settlements. -- The RIAA
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100913/7fa09a38/attachment.pgp>