[FFmpeg-devel] [PATCH] mmx implementation of vc-1 inverse transformations

Thu Jul 31 14:50:44 CEST 2008

Michael Niedermayer schrieb:

[...]

>   
>> @@ -467,7 +469,256 @@
>>  DECLARE_FUNCTION(3, 2)
>>  DECLARE_FUNCTION(3, 3)
>>  
>> +static void vc1_inv_trans_8x8_mmx(DCTELEM block[64])
>> +{
>> +    DECLARE_ALIGNED_16(int16_t, temp[64]);
>> +    asm volatile(
>> +    LOAD4(q,0x10,0x00(%0),%%mm5,%%mm1,%%mm0,%%mm3)
>> +    TRANSPOSE4(%%mm5,%%mm1,%%mm0,%%mm3,%%mm4)
>> +    STORE4(q,0x10,0x00(%0),%%mm5,%%mm3,%%mm4,%%mm0)
>> +
>> +    LOAD4(q,0x10,0x08(%0),%%mm6,%%mm5,%%mm7,%%mm1)
>> +    TRANSPOSE4(%%mm6,%%mm5,%%mm7,%%mm1,%%mm2)
>> +    STORE4(q,0x10,0x08(%0),%%mm6,%%mm1,%%mm2,%%mm7)
>>     
>
> it is still transposing the data at the begin of functions.
> I thought you transposed the scantables ...
>   

I did transpose the scantables except the one for the 8x8 
transformation, as it is used in several places and a lot more code has 
to be changed to accomodate the scantable change.

>
> [...]
>   
>> +    :
>> +    : "r"(block), "m"(temp[0])
>> +    : "memory"
>> +    );
>> +
>> +    asm volatile(
>>     
>
> why is this asm () block splited?
>   

for some stupid reason my gcc adds a "push ebx" and "pop ebx" to the 
start and the end of the function if I use more than 3 general purpose 
register in an asm block. I'm using gcc 4.3.1, is this some sort of bug, 
perhaps a known bug?

>
> [...]
>   
>> +    STORE4(q,0x10,0x40%1,%%mm4,%%mm7,%%mm0,%%mm6)
>> +    :
>> +    : "r"(block), "m"(temp[0]), "m"(ff_pw_4)
>> +    : "memory"
>> +    );
>> +
>> +    asm volatile(
>> +    "movq 0x30%3,  %%mm1\n\t" /* b[3] */
>> +    TRANSFORM_4X8_COL_H1
>> +    (
>> +        q,q,
>> +        0x00%3,0x10%3,0x20%3,0x40%3,0x70%3,
>>     
>
> this store and later load seems redundant
>   

I need them later in the second half of the 4x8 column transformation 
and for the first half I need b[3], b[5] and b[6] of which only b[5] and 
b[6] are already in the registers so I need to load b[3] and before I 
use any further data I use all of the remaining registers so I have to 
load them.

> and the asm should not be split
>   

see above.

>
> [...]
>   
>> +    STORE4(dqa,0x10,0x00(%0),%%xmm0,%%xmm5,%%xmm7,%%xmm3)
>> +    STORE4(dqa,0x10,0x40(%0),%%xmm6,%%xmm4,%%xmm2,%%xmm1)
>> +    TRANSFORM_8X4_ROW_H1
>> +    (
>> +        dqa,dqa,
>> +        0x00(%0),0x20(%0),0x40(%0),0x70(%0),
>>     
>
> some of these stores and loads seem redundant
>   

I need them for the second half of the 8x4 row transformation and again 
before I use any further data I used all the remaining register.

>
> [...]
>   
>> +void ff_vc1dsp_init_sse2(DSPContext* dsp, AVCodecContext *avctx) {
>> +    if(!(mm_flags & MM_SSE2))
>> +        return;
>> +
>> +    dsp->vc1_inv_trans_8x8 = vc1_inv_trans_8x8_sse2;
>> +    dsp->vc1_inv_trans_4x8 = vc1_inv_trans_4x8_sse2;
>> +    dsp->vc1_inv_trans_8x4 = vc1_inv_trans_8x4_sse2;
>> +}
>>     
>
> are all of the SSE2 variants faste than mmx?
>   
For me the 8x8 sse2 variant is faster than the mmx one, but as I 
metioned in an earlier post, the 4x8 isn't and the 8x4 is only a bit 
faster, that is why I asked if someone else could benchmark them, to see 
if they behave like that just for me.