[FFmpeg-devel] [PATCH] SSE2 Xvid idct

Michael Niedermayer michaelni
Fri Apr 11 01:53:38 CEST 2008


On Thu, Apr 10, 2008 at 06:42:40PM -0400, Alexander Strange wrote:
>
> On Apr 6, 2008, at 12:14 PM, Michael Niedermayer wrote:
>> On Sun, Apr 06, 2008 at 12:19:58AM -0400, Alexander Strange wrote:
>>> This adds skal's sse2 idct and uses it as the xvid idct when available.
>>>
>>> I merged two shuffles into the permutation and changed the zero-skipping
>>> some - it's fastest in MMX and not really worth doing for the first three
>>> rows. Their right halfs are still usually all zero, but adding the branch
>>> to check for it is a net loss. The best thing for speed would be 
>>> switching
>>> IDCTs by counting the last nonzero coefficient position, but that's
>>> something for later.
>>>
>>> xvididctheader - makes a new header so I don't add any more extern
>>> declarations in .c files.
>>> sse2-permute - the new permutation; it might not have a specific enough
>>> name, but it should work as well for simpleidct as this if I can get back
>>> to that.
>>> sse2-xvid-idct.diff + idct_sse2_xvid.c - the IDCT
>>>
>>> The URLs in the header (copied from idct_mmx_xvid and the original nasm
>>> source) are broken at the moment, but archive.org URLs are longer than 80
>>> characters, so I left them like they are.
>>>
>>> skal agreed it could be under LGPL in the last thread.
>> [...]
>>> #define SKIP_ROW_CHECK(src)                 \
>>>    "movq     "src", %%mm0            \n\t" \
>>>    "por    8+"src", %%mm0            \n\t" \
>>>    "packssdw %%mm0, %%mm0            \n\t" \
>>>    "movd     %%mm0, %%eax            \n\t" \
>>>    "testl    %%eax, %%eax            \n\t" \
>>>    "jz 1f                            \n\t"
>>
>> You could try to check pairs of rows, this might be faster for some rows.
>> Also the code should be interleaved not form such nasty dependancy chains
>> you do have enogh mmx registers.
>
> I think the movq breaks the dependence chain, at least on my CPU. But 
> moving stuff above the branch is good - changed to check two rows at once 
> for 3-6 and use MMX pmovmskb.

Good though not exactly what i meant.
What i meant was

"movq     "row1", %%mm1           \n\t" \
"por    8+"row1", %%mm1           \n\t" \
"movq     "row2", %%mm2           \n\t" \
"por    8+"row2", %%mm2           \n\t" \
"por       %%mm1, %%mm2           \n\t" \
"paddusb   %%mm0, %%mm2           \n\t" \
"pmovmskb  %%mm2, %%eax           \n\t" \
"testl     %%eax, %%eax           \n\t" \
"jz 123f                            \n\t"

for example maybe for rows 5 and 6

of course this is just a random idea and must be tested if its any good
or not.

Also maybe some speed could be gained by writing a few custom iLLM_PASS()
for some patterns of zero rows. Note, this does not need any checks anymore
as the rows have already been checked. But care must be taken here not
to "use" to much code cache. (and like everything else ignore it if its
slower)


[...]
> Index: libavcodec/dsputil.c
> ===================================================================
> --- libavcodec/dsputil.c	(revision 12786)
> +++ libavcodec/dsputil.c	(working copy)
> @@ -151,6 +151,8 @@
>          0x32, 0x3A, 0x36, 0x3B, 0x33, 0x3E, 0x37, 0x3F,
>  };
>  
> +static uint8_t idct_sse2_row_perm[8] = {0, 4, 1, 5, 2, 6, 3, 7};

const


[...]
> @@ -50,8 +51,6 @@
>  /* reference fdct/idct */
>  extern void fdct(DCTELEM *block);
>  extern void idct(DCTELEM *block);
> -extern void ff_idct_xvid_mmx(DCTELEM *block);
> -extern void ff_idct_xvid_mmx2(DCTELEM *block);
>  extern void init_fdct();
>  
>  extern void ff_mmx_idct(DCTELEM *data);

unrelated?


[...]
> @@ -158,6 +158,8 @@
>          0x32, 0x3A, 0x36, 0x3B, 0x33, 0x3E, 0x37, 0x3F,
>  };
>  
> +static uint8_t idct_sse2_row_perm[8] = {0, 4, 1, 5, 2, 6, 3, 7};
> +
>  void idct_mmx_init(void)
>  {
>      int i;

Duplicate, couldnt we use dsputil somehow? Not overly important ...


[...]
> 
> DECLARE_ASM_CONST(16, int32_t, walkenIdctRounders[]) = {
>  65536, 65536, 65536, 65536,
>   3597,  3597,  3597,  3597,
>   2260,  2260,  2260,  2260,
>   1203,  1203,  1203,  1203,
>      0,     0,     0,     0,
>    120,   120,   120,   120,
>    512,   512,   512,   512,
>    512,   512,   512,   512
> };

The bottom 2 look redundant and the 0 looks unneeded


[...]
> #define iMTX_MULT(src, table, rounder, put) \
>     "movdqa        "src", %%xmm3      \n\t" \
>     "movdqa       %%xmm3, %%xmm0      \n\t" \

try 
"src", %%xmm0
%%xmm0, %%xmm3
that way the next punpcklqdq does not depend on the movdqa
just an idea ...
or try to move the punpcklqdq after the pshufd


>     "punpcklqdq   %%xmm0, %%xmm0      \n\t" /* 0246 */\
>     "pshufd   $0x11, %%xmm3, %%xmm1   \n\t" /* 4602 */\
>     "pshufd   $0xBB, %%xmm3, %%xmm2   \n\t" /* 5713 */\
>     "punpckhqdq   %%xmm3, %%xmm3      \n\t" /* 1357 */\
>     "pmaddwd     "table", %%xmm0      \n\t" \
>     "pmaddwd  16+"table", %%xmm1      \n\t" \
>     "pmaddwd  32+"table", %%xmm2      \n\t" \
>     "pmaddwd  48+"table", %%xmm3      \n\t" \
>     "paddd        %%xmm3, %%xmm2      \n\t" \

>     "paddd        %%xmm1, %%xmm0      \n\t" \
>     rounder",     %%xmm0              \n\t" \

maybe it is faster to move rounder before the paddd, maybe not ...


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

There will always be a question for which you do not know the correct awnser.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080411/821efd81/attachment.pgp>



More information about the ffmpeg-devel mailing list