[FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths

Wed Mar 30 15:35:51 EEST 2022

On Fri, 25 Mar 2022, Ben Avison wrote:

> checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
> version can still outperform the NEON version in specific cases. The balance
> between different code paths is stream-dependent, but in practice the best
> case happens about 5% of the time, the worst case happens about 40% of the
> time, and the complexity of the remaining cases fall somewhere in between.
> Therefore, taking the average of the best and worst case timings is
> probably a conservative estimate of the degree by which the NEON code
> improves performance.
>
> vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7
> vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5
> vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5
> vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7
> vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2
> vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2
> vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2
> vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2
> vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0
> vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7
> vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7
> vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5
> vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7
> vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0
> vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7
> vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0
> vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2
> vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7
> vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0
> vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2
> vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0
> vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0
> vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2
> vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2
>
> Signed-off-by: Ben Avison <bavison at riscosopen.org>
> ---
> libavcodec/aarch64/Makefile              |   1 +
> libavcodec/aarch64/vc1dsp_init_aarch64.c |  14 +
> libavcodec/aarch64/vc1dsp_neon.S         | 698 +++++++++++++++++++++++
> 3 files changed, 713 insertions(+)
> create mode 100644 libavcodec/aarch64/vc1dsp_neon.S
>
> diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S
> new file mode 100644
> index 0000000000..70391b4179
> --- /dev/null
> +++ b/libavcodec/aarch64/vc1dsp_neon.S
> @@ -0,0 +1,698 @@
> +/*
> + * VC1 AArch64 NEON optimisations
> + *
> + * Copyright (c) 2022 Ben Avison <bavison at riscosopen.org>
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +.align  5
> +.Lcoeffs:
> +.quad   0x00050002
> +

This constant is problematic when building with MSVC/armasm64.exe (via 
gas-preprocessor). (gas-preprocessor, processing assembly for use with 
armasm, can't handle any completely general gas assembly, but works fine 
as long as one sticks to the general code patterns/macros we use.)

The issue here is that this is a naked label before the first 
function/endfunc/const/endconst block. In practice it works fine if this 
follows after an existing function though. (gas-preprocessor currently 
needs an explicit .text directive, to set it up to emit code to the .text 
section. I guess I could look into handling that implicitly too.)

In practice, this issue disappears further ahead in the patch stack when 
other functions are added above this in the source file though, so it's 
not really an issue - I just thought I'd mention it.

> +// VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of vertically-neighbouring blocks
> +// On entry:
> +//   x0 -> top-left pel of lower block
> +//   w1 = row stride, bytes
> +//   w2 = PQUANT bitstream parameter
> +function ff_vc1_v_loop_filter4_neon, export=1
> +        sub             x3, x0, w1, sxtw #2
> +        sxtw            x1, w1                  // technically, stride is signed int
> +        ldr             d0, .Lcoeffs
> +        ld1             {v1.s}[0], [x0], x1     // P5
> +        ld1             {v2.s}[0], [x3], x1     // P1
> +        ld1             {v3.s}[0], [x3], x1     // P2
> +        ld1             {v4.s}[0], [x0], x1     // P6
> +        ld1             {v5.s}[0], [x3], x1     // P3
> +        ld1             {v6.s}[0], [x0], x1     // P7
> +        ld1             {v7.s}[0], [x3]         // P4
> +        ld1             {v16.s}[0], [x0]        // P8
> +        ushll           v17.8h, v1.8b, #1       // 2*P5
> +        dup             v18.8h, w2              // pq
> +        ushll           v2.8h, v2.8b, #1        // 2*P1
> +        uxtl            v3.8h, v3.8b            // P2
> +        uxtl            v4.8h, v4.8b            // P6
> +        uxtl            v19.8h, v5.8b           // P3

Overall, the code looks sensible to me.

Would it make sense to share the 
core of the filter between the horizontal/vertical cases with e.g. a 
macro? (I didn't check in detail if there's much differences in the core 
of the filter. At most some differences in condition registers for partial 
writeout in the horizontal forms?)

If it's shareable, I guess the core of the filter even could be factorized 
to a separate sub-function to avoid duplicating it across the functions? 
(For the smaller filters it's probably no big deal, but for the bigger 
filters it could maybe be worthwhile?) It probably costs a couple cycles 
extra though, so if that's a too costly here, just macroing it to avoid 
duplication is fine too.

// Martin