[FFmpeg-devel] [PATCH 4/4] avcodec/h264: sse2, avx h luma mbaff deblock/loop filter

Wed Feb 15 18:55:51 EET 2017

On 2/13/2017 9:44 AM, James Darnley wrote:
> x86-64 only
> 
> Yorkfield:
> - sse2: 2.16x (434 vs. 201 cycles)
> 
> Skylake:
> - sse2: 3.04x (378 vs. 124 cycles)
> - avx:  3.29x (378 vs. 115 cycles)
> ---
>  libavcodec/x86/h264_deblock.asm | 119 ++++++++++++++++++++++++++++++++++++++++
>  libavcodec/x86/h264dsp_init.c   |  10 ++++
>  2 files changed, 129 insertions(+)
> 
> diff --git a/libavcodec/x86/h264_deblock.asm b/libavcodec/x86/h264_deblock.asm
> index 509a0dbe0c..f47a199e8f 100644
> --- a/libavcodec/x86/h264_deblock.asm
> +++ b/libavcodec/x86/h264_deblock.asm
> @@ -377,10 +377,129 @@ cglobal deblock_h_luma_8, 5,9,0,0x60+16*WIN64
>      RET
>  %endmacro
>  
> +; TODO: use macro arguments
> +%macro TRANSPOSE_8X8B_XMM 8

Why not put this in x86util? And using arguments, of course.
Also, just call it TRANSPOSE_8X8B.

> +    punpcklbw m0, m1
> +    punpcklbw m2, m3
> +    punpcklbw m4, m5
> +    punpcklbw m6, m7
> +
> +    punpckhwd m1, m0, m2
> +    punpcklwd m0, m2

Use SBUTTERFLY here and below.

> +
> +    punpckhwd m5, m4, m6
> +    punpcklwd m4, m6
> +
> +    punpckhdq m2, m0, m4
> +    punpckldq m0, m4
> +
> +    punpckhdq m6, m1, m5
> +    punpckldq m1, m5
> +
> +    MOVHL     m4, m0
> +    MOVHL     m3, m2
> +    MOVHL     m7, m6
> +    MOVHL     m5, m1
> +    SWAP 1, 4
> +%endmacro
> +
> +%macro TRANSPOSE_8X8B_XMM 0
> +    TRANSPOSE_8X8B_XMM 0, 1, 2, 3, 4, 5, 6, 7

This seems wrong, or at least superfluous.

> +%endmacro
> +
> +%macro DEBLOCK_H_LUMA_MBAFF 0
> +
> +cglobal deblock_h_luma_mbaff_8, 5, 9, 10, 8*16, pix_, stride_, alpha_, beta_, tc0_

Why the underscores?

> +    movsxd stride_q,  stride_d
> +    dec    alpha_d
> +    dec    beta_d
> +    mov    r5,        pix_q
> +    lea    r6,       [3*stride_q]

Call r6 stride3.