[FFmpeg-devel] [Patch][OpenHEVC]added ASM DBF functions

Thu May 15 19:29:55 CEST 2014

On 15/05/14 11:40 AM, Pierre Edouard Lepere wrote:
> Hi,
> Here is a patch adding Seppo Tomperi's ASM functions for HEVC loop filters with some quick fixes and cosmetic changes.
> 
> Regards,
> Pierre-Edouard Lepere

A couple comments below.

> +SECTION_RODATA
> +
> +pw_pixel_max: times 8 dw ((1 << 10)-1)
> +
> +SECTION .text
> +INIT_XMM sse2
> +
> +; expands to [base],...,[base+7*stride]
> +%define PASS8ROWS(base, base3, stride, stride3) \
> +    [base], [base+stride], [base+stride*2], [base3], \
> +    [base3+stride], [base3+stride*2], [base3+stride3], [base3+stride*4]
> +
> +; in: 8 rows of 4 bytes in %4..%11
> +; out: 4 rows of 8 words in m0..m3
> +%macro TRANSPOSE4x8B_LOAD 8
> +    movd             m0, %1
> +    movd             m2, %2
> +    movd             m1, %3
> +    movd             m3, %4
> +
> +    punpcklbw        m0, m2
> +    punpcklbw        m1, m3
> +    punpcklwd        m0, m1
> +
> +    movd             m4, %5
> +    movd             m6, %6
> +    movd             m5, %7
> +    movd             m7, %8
> +
> +    punpcklbw        m4, m6
> +    punpcklbw        m5, m7
> +    punpcklwd        m4, m5
> +
> +    movdqa           m2, m0
> +    punpckldq        m0, m4
> +    punpckhdq        m2, m4

There are tons of cases like this where you should instead use a 3-operand form, 
and let x86inc take care of the copy instruction if needed.
This will let you add xmm AVX versions of the functions that will be faster than 
their SSE2/SSSE3 counterparts because all these movdqa will be removed.

Also a nit: In general we use mova instead of movdqa/movaps. x86inc expands it 
to the correct instruction depending on what you used for INIT_[XY]MM.

[...]

> +; input in m0 ... m7, betas in r2 tcs in r3. Output in m1...m6
> +%macro LUMA_DEBLOCK_BODY 2
> +    movdqa           m9, m2
> +    psllw            m9, 1; *2
> +    movdqa          m10, m1
> +    psubw           m10, m9
> +    paddw           m10, m3
> +    pabsw           m10, m10 ; 0dp0, 0dp3 , 1dp0, 1dp3

ABS1, or PABSW when using pabsw with two different registers.
Then you can add an SSE2 version of the luma functions as well (Phenom users 
will thank you).