[FFmpeg-devel] [PATCH 2/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Mon Aug 19 00:58:12 EEST 2024

On Sun, Aug 18, 2024 at 10:43 PM Martin Storsjö <martin at martin.st> wrote:
> On Sun, 18 Aug 2024, Ramiro Polla wrote:
>
> >                   A53             A76
> > pix_norm1_c:     519.2           231.5
> > pix_norm1_neon:  195.0 ( 2.66x)   44.2 ( 5.24x)
> > pix_sum_c:       344.5           242.2
> > pix_sum_neon:    119.0 ( 2.89x)   41.7 ( 5.81x)
> > ---
>
> Hmm, those speedups on the A53 look quite small. I guess that's because
> this isn't unrolled at all, as you mention. Especially for A53, I would
> expect unrolling to have a very large effect here. But it sounds weird if
> you say perf indicates that it is slower in real world use. Yes, unrolling
> does make the code use more space and makes the I-cache less efficient,
> but in this case it would only be a difference of like 2 instructions?

These are the checkasm benchmarks I got for the unrolled version with
manual instruction ordering to give better results on the A53 (patch
attached for reference):
                      A53             A76
pix_norm1_c:        519.0           231.7
pix_norm1_neon:     140.0 ( 3.71x)   41.5 ( 5.58x)
pix_norm1_dotprod:                   17.2 (13.47x)
pix_sum_c:          347.2           242.0
pix_sum_neon:        72.0 ( 4.82x)   21.0 (11.52x)

I had tested the real world case on the A76, but not on the A53. I
spent a couple of hours with perf trying to find the source of the
discrepancy but I couldn't find anything conclusive. I need to learn
more about how to test cache misses.

I just tested again with the following command:
$ taskset -c 2 ./ffmpeg_g -benchmark -f lavfi -i
"testsrc2=size=1920x1080" -vcodec mpeg4 -q 31 -vframes 100 -f rawvideo
-y /dev/null

The entire test was about 1% faster unrolled on A53, but about 1%
slower unrolled on A76 (I had the Raspberry Pi 5 in mind for these
optimizations, so I preferred choosing the version that was faster on
the A76). I wonder if there is any way we could check at runtime. The
problem is also that I don't even know for certain what is causing
this.

> > diff --git a/libavcodec/aarch64/mpegvideoencdsp_neon.S b/libavcodec/aarch64/mpegvideoencdsp_neon.S
> > new file mode 100644
> > index 0000000000..89e50e29b3
> > --- /dev/null
> > +++ b/libavcodec/aarch64/mpegvideoencdsp_neon.S
> > @@ -0,0 +1,67 @@
> > +/*
> > + * Copyright (c) 2024 Ramiro Polla
> > + *
> > + * This file is part of FFmpeg.
> > + *
> > + * FFmpeg is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * FFmpeg is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with FFmpeg; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> > + */
> > +
> > +#include "libavutil/aarch64/asm.S"
> > +
> > +function ff_pix_sum16_neon, export=1
> > +// x0  const uint8_t *pix
> > +// x1  int line_size
> > +
> > +        sxtw            x1, w1
> > +        movi            v0.16b, #0
> > +        mov             w2, #16
> > +
> > +1:
> > +        ld1             { v1.16b }, [x0], x1
>
> Nit; we usually don't have these {} written with spaces inside of the
> braces, same below.

Oops, I should check my other neon code then...

> > +        subs            w2, w2, #1
> > +        uadalp          v0.8h, v1.16b
> > +        b.ne            1b
> > +
> > +        uaddlp          v0.4s, v0.8h
> > +        uaddlv          d0, v0.4s
>
> Couldn't this be aggregated with just one instruction, "uaddlv s0, v0.8h"?
> There's no need to widen it to 64 bit as we're truncating the returned
> value to 32 bit anyway.

Yes, that works. I'll fix it in the next iteration.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: aarch64_mpegvideoencdsp_unrolled.patch
Type: text/x-patch
Size: 11076 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20240818/364d61e8/attachment.bin>