[FFmpeg-devel] [PATCH] avcodec/vp9: add vp9_idct_idct_4x4_add_ssse3

Tue Oct 29 11:36:16 CET 2013

Hi,

Nice work overall. Some suggestions for testing:

On Mon, Oct 28, 2013 at 3:56 PM, Clément Bœsch <u at pkh.me> wrote:
>
> +; (a*x + b*y + round) >> shift
> +%macro VP9_MULSUB_2W_2X 6 ; dst1, dst2, src (unchanged), round, coefs1,
> coefs2
> +    movq                m%1, [%5]
> +    movq                m%2, [%6]
> +    pmaddwd             m%1, m%3
> +    pmaddwd             m%2, m%3
> +    paddd               m%1, m%4
> +    paddd               m%2, m%4
> +    psrad               m%1, 14
> +    psrad               m%2, 14
> +%endmacro
> +
> +%macro VP9_IDCT4_1D 0
> +    SUMSUB_BA           w, 2, 0, 4
> +    movq                m4, [pw_11585x2]
> +    pmulhrsw            m0, m4                              ; m0=t1
> +    pmulhrsw            m2, m4                              ; m2=t0
> +    movq                m6, m3
> +    punpckhwd           m3, m1
> +    VP9_MULSUB_2W_2X     4, 5, 3, 7, pw_t2_coef, pw_t3_coef
> +    punpcklwd           m6, m1
> +    VP9_MULSUB_2W_2X     1, 3, 6, 7, pw_t2_coef, pw_t3_coef

+    packssdw            m1, m4                              ; m1=t2
> +    packssdw            m3, m5                              ; m3=t3
>

So what you're doing here is to split 8 words over 2 registers so we can
paired multiplications etc; I wonder whether it'd be faster if (at least
for the full idct), we moved to XMM registers so this would all be a single
register, and the 2 halves could both be done in a single vp9_mulsub_2w_2x.
You can do INIT_XMM ssse3 and INIT_MMX ssse3 inside functions to switch
between the two. Just make sure you manually backup xmm6-7 for Win64
(there's a utility function for that in x86inc.asm, ask if you need help).

+%macro VP9_STORE_2X 2
> +    movd                m6, [dstq]
> +    movd                m7, [dstq+strideq]
> +    punpcklbw           m6, m4
> +    punpcklbw           m7, m4
> +    paddw               m6, %1
> +    paddw               m7, %2
> +    packuswb            m6, m4
> +    packuswb            m7, m4
> +    movd            [dstq], m6
> +    movd    [dstq+strideq], m7
> +%endmacro
>

Here too, using XMM could save you work. You can do 2 4-byte elements per
register so actually 4 rows at once if you pair it like this. And, as
Kieran mentioned, the zeroing itself could be 2 calls to movdqa instead of
4 to movq. So perhaps for the full IDCT, XMM does make sense?

Ronald