[FFmpeg-devel] [PATCH v2 3/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Mon Aug 26 13:55:51 EEST 2024

On Thu, Aug 22, 2024 at 1:29 PM Ramiro Polla <ramiro.polla at gmail.com> wrote:
> On Wed, Aug 21, 2024 at 9:44 PM Martin Storsjö <martin at martin.st> wrote:
> > On Wed, 21 Aug 2024, Ramiro Polla wrote:
> > >> BTW, this instruction is kinda exotic and the docs aren't super clear, so
> > >> it'd be good to test manually that it really does what we want, for
> > >> negative numbers and numbers close to the ends of the value range; I
> > >> didn't do that manually yet.
> > >
> > > I prefer just sticking to sxtw + lsl then. When we move to ptrdiff_t
> > > the sxtw will be gone anyway.
> >
> > This sounds like a very reasonable choice indeed, especially if it's
> > somewhat plausible that we'll get rid of it at some point in the future.
> >
> > >>> +        movi            v0.16b, #0
> > >>> +        mov             w3, #16
> > >>> +
> > >>> +1:
> > >>> +        ld1             {v1.16b}, [x0], x1
> > >>> +        ld1             {v2.16b}, [x2], x1
> > >>> +        subs            w3, w3, #2
> > >>> +        uadalp          v0.8h, v1.16b
> > >>> +        uadalp          v0.8h, v2.16b
> > >>> +        b.ne            1b
> > >>> +
> > >>> +        uaddlv          s0, v0.8h
> > >>> +        fmov            w0, s0
> > >>> +
> > >>> +        ret
> > >>> +endfunc
> > >>> +
> > >>> +function ff_pix_norm1_neon, export=1
> > >>> +// x0  const uint8_t *pix
> > >>> +// x1  int line_size
> > >>> +
> > >>> +        sxtw            x1, w1
> > >>> +        movi            v4.16b, #0
> > >>> +        movi            v5.16b, #0
> > >>> +        mov             w2, #16
> > >>> +
> > >>> +1:
> > >>> +        ld1             {v1.16b}, [x0], x1
> > >>> +        subs            w2, w2, #1
> > >>> +        umull           v2.8h, v1.8b,  v1.8b
> > >>> +        umull2          v3.8h, v1.16b, v1.16b
> > >>> +        uadalp          v4.4s, v2.8h
> > >>> +        uadalp          v5.4s, v3.8h
> > >>
> > >> From my earlier testing on A53, it seemed (surprisingly) to be equally
> > >> fast to accumulate into the same register for both instructions - but I
> > >> only tested that on A53. So we could change that here, getting rid of the
> > >> add at the end (and one movi). Or if it does help on some other core,
> > >> perhaps we should do the same for the function above too?
> > >
> > > Indeed, it is equally fast to accumulate into the same register on the
> > > A55 and A76 as well.
> > >
> > > New patches attached (patch 3/7 has functional changes, but patch 4/7
> > > only changes the commit message to reflect the new test run).
> >
> > LGTM very much now, thanks! And thanks for your patience through all the
> > iterations on such trivial patches as these.
>
> And thank you for your patience through the reviews :). I'm slowly
> getting up to speed with aarch64 and neon.
>
> I'll apply the pix_sum and pix_norm1 patches, and I'll wait a few days
> for any comments on the draw_edges patches.

Applied.