[FFmpeg-devel] [PATCH] SSE2 Xvid idct

Michael Niedermayer michaelni
Sun Apr 6 18:14:48 CEST 2008


On Sun, Apr 06, 2008 at 12:19:58AM -0400, Alexander Strange wrote:
> This adds skal's sse2 idct and uses it as the xvid idct when available.
>
> I merged two shuffles into the permutation and changed the zero-skipping 
> some - it's fastest in MMX and not really worth doing for the first three 
> rows. Their right halfs are still usually all zero, but adding the branch 
> to check for it is a net loss. The best thing for speed would be switching 
> IDCTs by counting the last nonzero coefficient position, but that's 
> something for later.
>
> xvididctheader - makes a new header so I don't add any more extern 
> declarations in .c files.
> sse2-permute - the new permutation; it might not have a specific enough 
> name, but it should work as well for simpleidct as this if I can get back 
> to that.
> sse2-xvid-idct.diff + idct_sse2_xvid.c - the IDCT
>
> The URLs in the header (copied from idct_mmx_xvid and the original nasm 
> source) are broken at the moment, but archive.org URLs are longer than 80 
> characters, so I left them like they are.
>
> skal agreed it could be under LGPL in the last thread.
[...]
> #define SKIP_ROW_CHECK(src)                 \
>     "movq     "src", %%mm0            \n\t" \
>     "por    8+"src", %%mm0            \n\t" \
>     "packssdw %%mm0, %%mm0            \n\t" \
>     "movd     %%mm0, %%eax            \n\t" \
>     "testl    %%eax, %%eax            \n\t" \
>     "jz 1f                            \n\t"

You could try to check pairs of rows, this might be faster for some rows.
Also the code should be interleaved not form such nasty dependancy chains
you do have enogh mmx registers.


> 
> #define iMTX_MULT(src, table, rounder)      \
>     "movdqa   "src", %%xmm0         \n\t"   \

>     "pshufd      $0, %%xmm0, %%xmm4 \n\t"   \
>     "pshufd   $0x55, %%xmm0, %%xmm6 \n\t"   \
>     "pshufd   $0xAA, %%xmm0, %%xmm5 \n\t"   \
>     "pshufd   $0xFF, %%xmm0, %%xmm7 \n\t"   \

you can replace 2 of the pshufd by 1 movdqa, 1unpckldqd and 1unpckhdqd
considering that pshufd seems to be slower this _could_ be faster.
here my notes about it
02461357
02461357 mov
02460246 unpck
13571357 unpck
46024602 shufld
57135713 shufld


[...]
> #define iLLM_PASS(dct)                      \
>     "movdqa   "MANGLE(tan3)", %%xmm0  \n\t" \
>     "movdqa      3*16("dct"), %%xmm3  \n\t" \
>     "movdqa           %%xmm0, %%xmm1  \n\t" \
>     "movdqa      5*16("dct"), %%xmm5  \n\t" \
>     "movdqa   "MANGLE(tan1)", %%xmm4  \n\t" \
>     "movdqa        16("dct"), %%xmm6  \n\t" \
>     "movdqa      7*16("dct"), %%xmm7  \n\t" \

if i didnt miscalculate it then you can keep 4 of the above in registers
from the row transform (and all 8 dct values for x86_64)


[...]
>     "movdqa   %%xmm2, ("dct")         \n\t" \
>     "movdqa   %%xmm3, %%xmm2          \n\t" \
>     "psubsw   %%xmm6, %%xmm3          \n\t" \
>     "paddsw   %%xmm2, %%xmm6          \n\t" \
>     "movdqa   %%xmm6, %%xmm2          \n\t" \
>     "psubsw   %%xmm7, %%xmm6          \n\t" \
>     "paddsw   %%xmm2, %%xmm7          \n\t" \
>     "movdqa   %%xmm3, %%xmm2          \n\t" \
>     "psubsw   %%xmm5, %%xmm3          \n\t" \
>     "paddsw   %%xmm2, %%xmm5          \n\t" \
>     "movdqa   %%xmm5, %%xmm2          \n\t" \
>     "psubsw   %%xmm0, %%xmm5          \n\t" \
>     "paddsw   %%xmm2, %%xmm0          \n\t" \
>     "movdqa   %%xmm3, %%xmm2          \n\t" \
>     "psubsw   %%xmm4, %%xmm3          \n\t" \
>     "paddsw   %%xmm2, %%xmm4          \n\t" \
>     "movdqa  ("dct"), %%xmm2          \n\t" \

i suspect this can be written without the load/store by using
add,add,sub buterflies (of course only if it is faster)


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No great genius has ever existed without some touch of madness. -- Aristotle
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080406/faf670c3/attachment.pgp>



More information about the ffmpeg-devel mailing list