[FFmpeg-devel] [PATCH 5/5] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_hv
Martin Storsjö
martin at martin.st
Sun Jul 2 00:28:17 EEST 2023
On Sun, 18 Jun 2023, Logan.Lyu wrote:
> Hi, Martin,
>
> I modified it according to your comments. Please review again.
> From 47b7f7af634add7680b56a216fff7dbe1f08cd11 Mon Sep 17 00:00:00 2001
> From: Logan Lyu <Logan.Lyu at myais.com.cn>
> Date: Sun, 28 May 2023 10:35:43 +0800
> Subject: [PATCH 5/5] lavc/aarch64: new optimization for 8-bit
> hevc_epel_uni_w_hv
>
> Signed-off-by: Logan Lyu <Logan.Lyu at myais.com.cn>
> ---
> libavcodec/aarch64/hevcdsp_epel_neon.S | 694 ++++++++++++++++++++++
> libavcodec/aarch64/hevcdsp_init_aarch64.c | 6 +
> 2 files changed, 700 insertions(+)
>
> diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
> index 8b6f396a0b..355679af29 100644
> --- a/libavcodec/aarch64/hevcdsp_epel_neon.S
> +++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
> @@ -717,6 +717,700 @@ function ff_hevc_put_hevc_epel_uni_w_h64_8_neon_i8mm, export=1
> ret
> endfunc
>
> +.macro epel_uni_w_hv_start
> + mov x15, x5 //denom
> + mov x16, x6 //wx
> + mov x17, x7 //ox
> + add w15, w15, #6 //shift = denom+6
> +
> +
> + ldp x5, x6, [sp]
> + ldr x7, [sp, #16]
> +
> + stp q12, q13, [sp, #-128]!
> + stp q14, q15, [sp, #32]
> + stp q8, q9, [sp, #64]
> + stp q10, q11, [sp, #96]
Only need to back up 64 bytes, by backing up d8-d15. Also, the order
is quite weird here, why not keep them in e.g. linear order?
> +function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
> + epel_uni_w_hv_start
> + sxtw x4, w4
> +
> + add x10, x4, #3
> + lsl x10, x10, #7
> + sub sp, sp, x10 // tmp_array
> + stp xzr, x30, [sp, #-48]!
As mentioned already in the previous review - why do you back up and
restore xzr here? That's not necessary. Yes, you should keep the stack
16 byte aligned, but you can just leave an empty slot, and just do
"str x30, [sp, #-48]!" here, and vice versa with "ldr" instead of ldp
when restoring.
The same goes in all functions here.
> +2:
> + ldp q14, q15, [sp, #32]
> + ldp q8, q9, [sp, #64]
> + ldp q10, q11, [sp, #96]
> + ldp q12, q13, [sp], #128
Only need d8-d15, and weird register order here, and elsewhere.
> +function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
> + epel_uni_w_hv_start
> + sxtw x4, w4
FWIW, it's unusual to need an explicit sxtw instruction, but I guess
if you use it in the form "add x10, x4, #3" it might be needed.
> +function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
> + ldp x15, x16, [sp]
> + stp x0, x30, [sp, #-16]!
> + stp x1, x2, [sp, #-16]!
> + stp x3, x4, [sp, #-16]!
> + stp x5, x6, [sp, #-16]!
Don't do consecutive stack pointer updates like this, but merge it
into one large stack decrement followed by positive offsets, like in
all the other cases of stp/ldp.
> + mov x17, #16
> + stp x17, x7, [sp, #-16]!
> + stp x15, x16, [sp, #-16]!
> + bl X(ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm)
> + ldp x15, x16, [sp], #16
> + ldp x17, x7, [sp], #16
> + ldp x5, x6, [sp], #16
> + ldp x3, x4, [sp], #16
> + ldp x1, x2, [sp], #16
> + ldr x0, [sp]
> + add x0, x0, #16
> + add x2, x2, #16
> + mov x17, #16
> + stp x17, xzr, [sp, #-16]!
> + stp x15, x16, [sp, #-16]!
Don't do multiple stack decrements, don't needlessly store xzr here.
The same goes for all the other functions in this patch.
// Martin
More information about the ffmpeg-devel
mailing list