[FFmpeg-devel] [PATCH] SSE2 Xvid idct

Alexander Strange astrange
Sun Apr 13 11:35:01 CEST 2008


On Apr 12, 2008, at 8:15 AM, Michael Niedermayer wrote:
> On Sat, Apr 12, 2008 at 04:29:28AM -0400, Alexander Strange wrote:
>>
>> On Apr 10, 2008, at 7:53 PM, Michael Niedermayer wrote:
>>> On Thu, Apr 10, 2008 at 06:42:40PM -0400, Alexander Strange wrote:
>>>>
>>>> On Apr 6, 2008, at 12:14 PM, Michael Niedermayer wrote:
>>>>> On Sun, Apr 06, 2008 at 12:19:58AM -0400, Alexander Strange wrote:
>>>>>> This adds skal's sse2 idct and uses it as the xvid idct when  
>>>>>> available.
>>>>>>
>>>>>> I merged two shuffles into the permutation and changed the
>>>>>> zero-skipping
>>>>>> some - it's fastest in MMX and not really worth doing for the  
>>>>>> first
>>>>>> three
>>>>>> rows. Their right halfs are still usually all zero, but adding  
>>>>>> the
>>>>>> branch
>>>>>> to check for it is a net loss. The best thing for speed would be
>>>>>> switching
>>>>>> IDCTs by counting the last nonzero coefficient position, but  
>>>>>> that's
>>>>>> something for later.
>>>>>>
>>>>>> xvididctheader - makes a new header so I don't add any more  
>>>>>> extern
>>>>>> declarations in .c files.
>>>>>> sse2-permute - the new permutation; it might not have a  
>>>>>> specific enough
>>>>>> name, but it should work as well for simpleidct as this if I  
>>>>>> can get
>>>>>> back
>>>>>> to that.
>>>>>> sse2-xvid-idct.diff + idct_sse2_xvid.c - the IDCT
>>>>>>
>>>>>> The URLs in the header (copied from idct_mmx_xvid and the  
>>>>>> original nasm
>>>>>> source) are broken at the moment, but archive.org URLs are  
>>>>>> longer than
>>>>>> 80
>>>>>> characters, so I left them like they are.
>>>>>>
>>>>>> skal agreed it could be under LGPL in the last thread.
>>>>> [...]
>>>>>> #define SKIP_ROW_CHECK(src)                 \
>>>>>>  "movq     "src", %%mm0            \n\t" \
>>>>>>  "por    8+"src", %%mm0            \n\t" \
>>>>>>  "packssdw %%mm0, %%mm0            \n\t" \
>>>>>>  "movd     %%mm0, %%eax            \n\t" \
>>>>>>  "testl    %%eax, %%eax            \n\t" \
>>>>>>  "jz 1f                            \n\t"
>>>>>
>>>>> You could try to check pairs of rows, this might be faster for  
>>>>> some
>>>>> rows.
>>>>> Also the code should be interleaved not form such nasty dependancy
>>>>> chains
>>>>> you do have enogh mmx registers.
>>>>
>>>> I think the movq breaks the dependence chain, at least on my CPU.  
>>>> But
>>>> moving stuff above the branch is good - changed to check two rows  
>>>> at once
>>>> for 3-6 and use MMX pmovmskb.
>>>
>>> Good though not exactly what i meant.
>>> What i meant was
>>>
>>> "movq     "row1", %%mm1           \n\t" \
>>> "por    8+"row1", %%mm1           \n\t" \
>>> "movq     "row2", %%mm2           \n\t" \
>>> "por    8+"row2", %%mm2           \n\t" \
>>> "por       %%mm1, %%mm2           \n\t" \
>>> "paddusb   %%mm0, %%mm2           \n\t" \
>>> "pmovmskb  %%mm2, %%eax           \n\t" \
>>> "testl     %%eax, %%eax           \n\t" \
>>> "jz 123f                            \n\t"
>>>
>>> for example maybe for rows 5 and 6
>>>
>>> of course this is just a random idea and must be tested if its any  
>>> good
>>> or not.
>>>
>>> Also maybe some speed could be gained by writing a few custom  
>>> iLLM_PASS()
>>> for some patterns of zero rows. Note, this does not need any checks
>>> anymore
>>> as the rows have already been checked. But care must be taken here  
>>> not
>>> to "use" to much code cache. (and like everything else ignore it  
>>> if its
>>> slower)
>>
>> I had looked into that, but it didn't seem to help - no change for  
>> xvid and
>> it got a bit worse for mpeg2 (which I'm trying not to pessimize  
>> since the
>> idct is already quite good for it). Rows 0-2 are always nonzero due  
>> to the
>> rounder, and 3 and 7 seem to be set often, so I added one for 4-6  
>> being
>> zero, which seems like it worked. The structure is somewhat strange  
>> now,
>
> 7 is a special case for mpeg2 which should disapear with  
> CODEC_FLAG2_FAST.
> So iam tempted to suggest to also add 7 into the mix and just tell  
> users
> to use CODEC_FLAG2_FAST. (of course only if the non zero 7 disapears  
> with
> it)

Done. It is still somewhat slower with mpeg2+fast, maybe because the  
lower chance of zeros makes the branches mispredict more, but as it  
works better for xvid I'll stick with it. (tried some branch merging  
and didn't see any improvement)


>> but it gets rid of one branch and it's faster than the other ways I  
>> tried.
>> (for instance, replacing two of the three test/jnz with an or/jnz, or
>> removing the last zero skip for row 6)
>>
>> The added align helps some rule about branches not crossing 16-byte  
>> blocks
>> - it seemed a little bit positive and only adds one nop instruction.
>>
>> I guess there might be some small advantage from adding ones for  
>> 5+6 being
>> zero and 3-6 being zero, but probably not enough to be worth it,  
>> since the
>> LLM pass is already quite fast compared to the dot products in the  
>> rows.
>
> [...]
>
>> Index: libavcodec/i386/dsputil_mmx.c
>> ===================================================================
>> --- libavcodec/i386/dsputil_mmx.c	(revision 12670)
>> +++ libavcodec/i386/dsputil_mmx.c	(working copy)
>> @@ -2126,7 +2127,12 @@
>>             }else if(idct_algo==FF_IDCT_CAVS){
>>                     c->idct_permutation_type= FF_TRANSPOSE_IDCT_PERM;
>>             }else if(idct_algo==FF_IDCT_XVIDMMX){
>> -                if(mm_flags & MM_MMXEXT){
>> +                if (mm_flags & MM_SSE2){
>> +                    c->idct_put= ff_idct_xvid_sse2_put;
>> +                    c->idct_add= ff_idct_xvid_sse2_add;
>> +                    c->idct    = ff_idct_xvid_sse2;
>> +                    c->idct_permutation_type= FF_SSE2_IDCT_PERM;
>> +                }else if(mm_flags & MM_MMXEXT){
>
> if( vs. if (

changed


>>    "psubsw   %%xmm6, %%xmm5          \n\t" \
>>    "movdqa   "ROW0", %%xmm4          \n\t" \
>>    "movdqa   "ROW4", %%xmm6          \n\t" \
>>    "movdqa   %%xmm2, "spill"         \n\t" \
>>    "movdqa   %%xmm4, %%xmm2          \n\t" \
>>    "psubsw   %%xmm6, %%xmm4          \n\t" \
>>    "paddsw   %%xmm2, %%xmm6          \n\t" \
>>    "movdqa   %%xmm6, %%xmm2          \n\t" \
>>    "psubsw   %%xmm7, %%xmm6          \n\t" \
>>    "paddsw   %%xmm2, %%xmm7          \n\t" \
>>    "movdqa   %%xmm4, %%xmm2          \n\t" \
>>    "psubsw   %%xmm5, %%xmm4          \n\t" \
>>    "paddsw   %%xmm2, %%xmm5          \n\t" \
>>    "movdqa   %%xmm5, %%xmm2          \n\t" \
>>    "psubsw   %%xmm0, %%xmm5          \n\t" \
>>    "paddsw   %%xmm2, %%xmm0          \n\t" \
>>    "movdqa   %%xmm4, %%xmm2          \n\t" \
>>    "psubsw   %%xmm3, %%xmm4          \n\t" \
>>    "paddsw   %%xmm2, %%xmm3          \n\t" \
>>    "movdqa  "spill", %%xmm2          \n\t" \
>
> #ifdef ARCH_X86_64
> # define XMMS   "%%xmm12"
> #else
> # define XMMS   "%%xmm2"
> #endif
> s/%%xmm2/XMMS/
>
> #ifndef ARCH_X86_64
> "movdqa   %%xmm2, "spill"         \n\t" \
> #endif
> ...
> #ifndef ARCH_X86_64
> "movdqa  "spill", %%xmm2          \n\t" \
> #endif
>
> or a
> MOV_ONLY_ON32" %%xmm2, ...
>
>
> And i think something similar can be don with ROW*

Done. The row part is already optimal on 64 since pshufhw handles it.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sse2-xvid-idct.diff
Type: application/octet-stream
Size: 1826 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080413/5a5cce5d/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: idct_sse2_xvid.c
Type: application/octet-stream
Size: 15002 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080413/5a5cce5d/attachment-0001.obj>



More information about the ffmpeg-devel mailing list