[FFmpeg-devel] [PATCH 2/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Sun Aug 18 23:43:23 EEST 2024

On Sun, 18 Aug 2024, Ramiro Polla wrote:

>                   A53             A76
> pix_norm1_c:     519.2           231.5
> pix_norm1_neon:  195.0 ( 2.66x)   44.2 ( 5.24x)
> pix_sum_c:       344.5           242.2
> pix_sum_neon:    119.0 ( 2.89x)   41.7 ( 5.81x)
> ---

Hmm, those speedups on the A53 look quite small. I guess that's because 
this isn't unrolled at all, as you mention. Especially for A53, I would 
expect unrolling to have a very large effect here. But it sounds weird if 
you say perf indicates that it is slower in real world use. Yes, unrolling 
does make the code use more space and makes the I-cache less efficient, 
but in this case it would only be a difference of like 2 instructions?

> diff --git a/libavcodec/aarch64/mpegvideoencdsp_neon.S b/libavcodec/aarch64/mpegvideoencdsp_neon.S
> new file mode 100644
> index 0000000000..89e50e29b3
> --- /dev/null
> +++ b/libavcodec/aarch64/mpegvideoencdsp_neon.S
> @@ -0,0 +1,67 @@
> +/*
> + * Copyright (c) 2024 Ramiro Polla
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +function ff_pix_sum16_neon, export=1
> +// x0  const uint8_t *pix
> +// x1  int line_size
> +
> +        sxtw            x1, w1
> +        movi            v0.16b, #0
> +        mov             w2, #16
> +
> +1:
> +        ld1             { v1.16b }, [x0], x1

Nit; we usually don't have these {} written with spaces inside of the 
braces, same below.

> +        subs            w2, w2, #1
> +        uadalp          v0.8h, v1.16b
> +        b.ne            1b
> +
> +        uaddlp          v0.4s, v0.8h
> +        uaddlv          d0, v0.4s

Couldn't this be aggregated with just one instruction, "uaddlv s0, v0.8h"? 
There's no need to widen it to 64 bit as we're truncating the returned 
value to 32 bit anyway.

// Martin