[FFmpeg-devel] [Patch][OpenHEVC]added ASM DBF functions
James Almer
jamrial at gmail.com
Thu May 15 19:29:55 CEST 2014
On 15/05/14 11:40 AM, Pierre Edouard Lepere wrote:
> Hi,
> Here is a patch adding Seppo Tomperi's ASM functions for HEVC loop filters with some quick fixes and cosmetic changes.
>
> Regards,
> Pierre-Edouard Lepere
A couple comments below.
> +SECTION_RODATA
> +
> +pw_pixel_max: times 8 dw ((1 << 10)-1)
> +
> +SECTION .text
> +INIT_XMM sse2
> +
> +; expands to [base],...,[base+7*stride]
> +%define PASS8ROWS(base, base3, stride, stride3) \
> + [base], [base+stride], [base+stride*2], [base3], \
> + [base3+stride], [base3+stride*2], [base3+stride3], [base3+stride*4]
> +
> +; in: 8 rows of 4 bytes in %4..%11
> +; out: 4 rows of 8 words in m0..m3
> +%macro TRANSPOSE4x8B_LOAD 8
> + movd m0, %1
> + movd m2, %2
> + movd m1, %3
> + movd m3, %4
> +
> + punpcklbw m0, m2
> + punpcklbw m1, m3
> + punpcklwd m0, m1
> +
> + movd m4, %5
> + movd m6, %6
> + movd m5, %7
> + movd m7, %8
> +
> + punpcklbw m4, m6
> + punpcklbw m5, m7
> + punpcklwd m4, m5
> +
> + movdqa m2, m0
> + punpckldq m0, m4
> + punpckhdq m2, m4
There are tons of cases like this where you should instead use a 3-operand form,
and let x86inc take care of the copy instruction if needed.
This will let you add xmm AVX versions of the functions that will be faster than
their SSE2/SSSE3 counterparts because all these movdqa will be removed.
Also a nit: In general we use mova instead of movdqa/movaps. x86inc expands it
to the correct instruction depending on what you used for INIT_[XY]MM.
[...]
> +; input in m0 ... m7, betas in r2 tcs in r3. Output in m1...m6
> +%macro LUMA_DEBLOCK_BODY 2
> + movdqa m9, m2
> + psllw m9, 1; *2
> + movdqa m10, m1
> + psubw m10, m9
> + paddw m10, m3
> + pabsw m10, m10 ; 0dp0, 0dp3 , 1dp0, 1dp3
ABS1, or PABSW when using pabsw with two different registers.
Then you can add an SSE2 version of the luma functions as well (Phenom users
will thank you).
More information about the ffmpeg-devel
mailing list