[FFmpeg-devel] [PATCH] mmx implementation of vc-1 inverse transformations

Thu Jul 31 18:56:18 CEST 2008

On Thu, Jul 31, 2008 at 02:50:44PM +0200, Victor Pollex wrote:
> Michael Niedermayer schrieb:
> 
> [...]
> 
> >   
> >> @@ -467,7 +469,256 @@
> >>  DECLARE_FUNCTION(3, 2)
> >>  DECLARE_FUNCTION(3, 3)
> >>  
> >> +static void vc1_inv_trans_8x8_mmx(DCTELEM block[64])
> >> +{
> >> +    DECLARE_ALIGNED_16(int16_t, temp[64]);
> >> +    asm volatile(
> >> +    LOAD4(q,0x10,0x00(%0),%%mm5,%%mm1,%%mm0,%%mm3)
> >> +    TRANSPOSE4(%%mm5,%%mm1,%%mm0,%%mm3,%%mm4)
> >> +    STORE4(q,0x10,0x00(%0),%%mm5,%%mm3,%%mm4,%%mm0)
> >> +
> >> +    LOAD4(q,0x10,0x08(%0),%%mm6,%%mm5,%%mm7,%%mm1)
> >> +    TRANSPOSE4(%%mm6,%%mm5,%%mm7,%%mm1,%%mm2)
> >> +    STORE4(q,0x10,0x08(%0),%%mm6,%%mm1,%%mm2,%%mm7)
> >>     
> >
> > it is still transposing the data at the begin of functions.
> > I thought you transposed the scantables ...
> >   
> 
> I did transpose the scantables except the one for the 8x8 
> transformation, as it is used in several places and a lot more code has 
> to be changed to accomodate the scantable change.

Iam ok with you ommiting the 8x8 transform (so only the other 3 are optimized)
but iam not ok with them being implemented suboptimally.

> 
> >
> > [...]
> >   
> >> +    :
> >> +    : "r"(block), "m"(temp[0])
> >> +    : "memory"
> >> +    );
> >> +
> >> +    asm volatile(
> >>     
> >
> > why is this asm () block splited?
> >   
> 
> for some stupid reason my gcc adds a "push ebx" and "pop ebx" to the 
> start and the end of the function if I use more than 3 general purpose 

:/ ok then leave them splited

> register in an asm block. I'm using gcc 4.3.1, is this some sort of bug, 
> perhaps a known bug?

as mans said gcc is a bug.
Ive not seen this one previously, but ive seen gcc put really silly things
between asm() blocks when it could, thats why i suggested to merge them.
But if that causes worse code to be generated, then it misses the point.

> 
> >
> > [...]
> >   
> >> +    STORE4(q,0x10,0x40%1,%%mm4,%%mm7,%%mm0,%%mm6)
> >> +    :
> >> +    : "r"(block), "m"(temp[0]), "m"(ff_pw_4)
> >> +    : "memory"
> >> +    );
> >> +
> >> +    asm volatile(
> >> +    "movq 0x30%3,  %%mm1\n\t" /* b[3] */
> >> +    TRANSFORM_4X8_COL_H1
> >> +    (
> >> +        q,q,
> >> +        0x00%3,0x10%3,0x20%3,0x40%3,0x70%3,
> >>     
> >
> > this store and later load seems redundant
> >   
> 
> I need them later in the second half of the 4x8 column transformation 
> and for the first half I need b[3], b[5] and b[6] of which only b[5] and 
> b[6] are already in the registers so I need to load b[3] and before I 
> use any further data I use all of the remaining registers so I have to 
> load them.

Well, i dont care enough about VC1 to check if there maybe is a way to
reorder things and avoid them so iam fine with that part

[...]
> >
> > [...]
> >   
> >> +void ff_vc1dsp_init_sse2(DSPContext* dsp, AVCodecContext *avctx) {
> >> +    if(!(mm_flags & MM_SSE2))
> >> +        return;
> >> +
> >> +    dsp->vc1_inv_trans_8x8 = vc1_inv_trans_8x8_sse2;
> >> +    dsp->vc1_inv_trans_4x8 = vc1_inv_trans_4x8_sse2;
> >> +    dsp->vc1_inv_trans_8x4 = vc1_inv_trans_8x4_sse2;
> >> +}
> >>     
> >
> > are all of the SSE2 variants faste than mmx?
> >   
> For me the 8x8 sse2 variant is faster than the mmx one, but as I 
> metioned in an earlier post, the 4x8 isn't and the 8x4 is only a bit 
> faster, that is why I asked if someone else could benchmark them, to see 
> if they behave like that just for me.

In the anbsense of anyone posting benchmark we only have yours that says
that one isnt faster so it should be disabled, it could be kept in the file
under #if 0 for others to test ...

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Concerning the gods, I have no means of knowing whether they exist or not
or of what sort they may be, because of the obscurity of the subject, and
the brevity of human life -- Protagoras
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080731/3b595cd9/attachment.pgp>