[FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths
Martin Storsjö
martin at martin.st
Wed Mar 30 15:35:51 EEST 2022
On Fri, 25 Mar 2022, Ben Avison wrote:
> checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
> version can still outperform the NEON version in specific cases. The balance
> between different code paths is stream-dependent, but in practice the best
> case happens about 5% of the time, the worst case happens about 40% of the
> time, and the complexity of the remaining cases fall somewhere in between.
> Therefore, taking the average of the best and worst case timings is
> probably a conservative estimate of the degree by which the NEON code
> improves performance.
>
> vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7
> vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5
> vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5
> vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7
> vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2
> vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2
> vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2
> vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2
> vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0
> vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7
> vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7
> vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5
> vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7
> vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0
> vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7
> vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0
> vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2
> vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7
> vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0
> vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2
> vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0
> vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0
> vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2
> vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2
>
> Signed-off-by: Ben Avison <bavison at riscosopen.org>
> ---
> libavcodec/aarch64/Makefile | 1 +
> libavcodec/aarch64/vc1dsp_init_aarch64.c | 14 +
> libavcodec/aarch64/vc1dsp_neon.S | 698 +++++++++++++++++++++++
> 3 files changed, 713 insertions(+)
> create mode 100644 libavcodec/aarch64/vc1dsp_neon.S
>
> diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S
> new file mode 100644
> index 0000000000..70391b4179
> --- /dev/null
> +++ b/libavcodec/aarch64/vc1dsp_neon.S
> @@ -0,0 +1,698 @@
> +/*
> + * VC1 AArch64 NEON optimisations
> + *
> + * Copyright (c) 2022 Ben Avison <bavison at riscosopen.org>
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +.align 5
> +.Lcoeffs:
> +.quad 0x00050002
> +
This constant is problematic when building with MSVC/armasm64.exe (via
gas-preprocessor). (gas-preprocessor, processing assembly for use with
armasm, can't handle any completely general gas assembly, but works fine
as long as one sticks to the general code patterns/macros we use.)
The issue here is that this is a naked label before the first
function/endfunc/const/endconst block. In practice it works fine if this
follows after an existing function though. (gas-preprocessor currently
needs an explicit .text directive, to set it up to emit code to the .text
section. I guess I could look into handling that implicitly too.)
In practice, this issue disappears further ahead in the patch stack when
other functions are added above this in the source file though, so it's
not really an issue - I just thought I'd mention it.
> +// VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of vertically-neighbouring blocks
> +// On entry:
> +// x0 -> top-left pel of lower block
> +// w1 = row stride, bytes
> +// w2 = PQUANT bitstream parameter
> +function ff_vc1_v_loop_filter4_neon, export=1
> + sub x3, x0, w1, sxtw #2
> + sxtw x1, w1 // technically, stride is signed int
> + ldr d0, .Lcoeffs
> + ld1 {v1.s}[0], [x0], x1 // P5
> + ld1 {v2.s}[0], [x3], x1 // P1
> + ld1 {v3.s}[0], [x3], x1 // P2
> + ld1 {v4.s}[0], [x0], x1 // P6
> + ld1 {v5.s}[0], [x3], x1 // P3
> + ld1 {v6.s}[0], [x0], x1 // P7
> + ld1 {v7.s}[0], [x3] // P4
> + ld1 {v16.s}[0], [x0] // P8
> + ushll v17.8h, v1.8b, #1 // 2*P5
> + dup v18.8h, w2 // pq
> + ushll v2.8h, v2.8b, #1 // 2*P1
> + uxtl v3.8h, v3.8b // P2
> + uxtl v4.8h, v4.8b // P6
> + uxtl v19.8h, v5.8b // P3
Overall, the code looks sensible to me.
Would it make sense to share the
core of the filter between the horizontal/vertical cases with e.g. a
macro? (I didn't check in detail if there's much differences in the core
of the filter. At most some differences in condition registers for partial
writeout in the horizontal forms?)
If it's shareable, I guess the core of the filter even could be factorized
to a separate sub-function to avoid duplicating it across the functions?
(For the smaller filters it's probably no big deal, but for the bigger
filters it could maybe be worthwhile?) It probably costs a couple cycles
extra though, so if that's a too costly here, just macroing it to avoid
duplication is fine too.
// Martin
More information about the ffmpeg-devel
mailing list