[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try#8

Balatoni Denes dbalatoni
Thu Aug 30 20:42:51 CEST 2007


Hi!

New patch attached.

On Thursday 30 August 2007 01:25, Michael Niedermayer wrote:
> > @@ -4045,6 +4049,13 @@
> >    int accel = vis_level ();
> >
> >    if (accel & ACCEL_SPARC_VIS) {
> > +      if(avctx->idct_algo==FF_IDCT_SIMPLEVIS){
> > +                c->idct_put = ff_simple_idct_put_vis;
> > +                c->idct_add = ff_simple_idct_add_vis;
> > +                c->idct     = ff_simple_idct_vis;
> > +                c->idct_permutation_type = FF_TRANSPOSE_IDCT_PERM;
> > +      }
> > +
>
> this should be 4 spaces indented

Yes, sorry about that.


> > +        "fbe 3f                        \n\t"\
> > +        "nop                           \n\t"\
>
> you can move a instruction into the nop slot, its always executed if the
> annul bit is not set according to docs so the fpadd16 %%f26, %%f2, %%f26
> from above would be a choice
> this applies to all the other nop as well

Ok, I did this.

> > +    /* 2. column */\
> > +        "for %%f4, %%f6, %%f60         \n\t"\
> > +        "fcmpd %%fcc0, %%f62, %%f60    \n\t"\
>
> the for and fcmpd can be moved up (with some distance from each other
> so to avoid the 10 cycle stall (you said all instructions have a latency
> of 6 on the US T2) this should cause theres nothing touching any of
> f4,f6,f60,f62,fcc above so this should work
[...]
> > +    /* 3. column */\
> > +        "3:                             \n\t"\
> > +        "for %%f8, %%f10, %%f60         \n\t"\
> > +        "fcmpd %%fcc0, %%f62, %%f60     \n\t"\
>
> the for and fcmp can similarely be moved up, you have to switch to fcc1
> though to avoid a conflict with the above ones
> this applies to the other for/fcmpd as well

You were right, all four floating point condition registers can be used - I 
misunderstood the documentation. Now everything is moved up, and this did 
lead to a measurable 3% speedup (as it should have) on "my" UltraSPARC IIIi!

> [...]
>
> > +        TRANSPOSE
> > +        IDCT4ROWS
> > +        SCALEROWS
> > +        PUTPIXELSCLAMPED("0")
> > +        LOAD("%2+64")
> > +        TRANSPOSE
> > +        IDCT4ROWS
> > +        SCALEROWS
> > +        PUTPIXELSCLAMPED("4")
>
> the SCALEROWS is unneeded, the fpack16 can do the downshift and a single
> addition to the 0,0 coefficient before the idct or first column after the
> transpose can compensate for the rounding difference
>
>
> [...]
>
> > +        TRANSPOSE
> > +        IDCT4ROWS
> > +        SCALEROWS
> > +        ADDPIXELSCLAMPED("0")
> > +        LOAD("%2+64")
> > +        TRANSPOSE
> > +        IDCT4ROWS
> > +        SCALEROWS
> > +        ADDPIXELSCLAMPED("4")
>
> same here, the SCALEROWS can be avoided by changing the shift used in
> fpack16 and the expansion value for the added pixels as well as adding a
> bias with a single instruction further above

Ok, I did this too. I missed this before somehow.

> [...]

bye
Denes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simple_idct_vis_try8.diff
Type: text/x-diff
Size: 21431 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070830/e0184384/attachment.diff>



More information about the ffmpeg-devel mailing list