[FFmpeg-devel] [PATCH] NEON code for basic scalar ops

Thu Aug 13 06:44:56 CEST 2009

On Thu, Aug 13, 2009 at 12:33:07AM +0100, M?ns Rullg?rd wrote:
> Kostya <kostya.shishkov at gmail.com> writes:
> 
> > On Tue, Jul 21, 2009 at 03:23:58PM +0100, M?ns Rullg?rd wrote:
> >> Kostya <kostya.shishkov at gmail.com> writes:
> >> 
> >> > While waiting for RTMP patch review, here's a bit of NEON code to speed
> >> > up int16 array addition/subtraction and scalar product calculation.
> >> >
> >> > This about halves decoding time for APE compressed at insane level
> >> > (so it's only 7 times slower than realtime on my BeagleBoard).
> >> 
> >> These functions are far from optimal.
> >
> > Since I won't be able to work at it for some time I post here version
> > that is few cycles closer to optimal (but still far away).
> >
> > +function ff_scalarproduct_int16_neon, export=1
> > +        vmov.i16        q0,  #0
> > +        vmov.i16        q1,  #0
> > +        vmov.i16        q2,  #0
> > +        vmov.i16        q3,  #0
> > +1:      vld1.16         {d16-d17}, [r0]!
> > +        vld1.16         {d20-d21}, [r1,:128]!
> > +        vmlal.s16       q0,  d16,  d20
> > +        vld1.16         {d18-d19}, [r0]!
> > +        vmlal.s16       q1,  d17,  d21
> > +        vld1.16         {d22-d23}, [r1,:128]!
> > +        vmlal.s16       q2,  d18,  d22
> > +        vmlal.s16       q3,  d19,  d23
> > +        subs            r2,  r2,   #16
> > +        bne             1b
> > +        vpadd.s32       d8,  d0,   d1
> > +        vpadd.s32       d9,  d2,   d3
> > +        vpadd.s32       d10, d4,   d5
> > +        vpadd.s32       d11, d6,   d7
> > +        vpadd.s32       d0,  d8,   d9
> > +        vpadd.s32       d1,  d10,  d11
> > +        vpadd.s32       d2,  d0,   d1
> > +        vpaddl.s32      d3,  d2
> > +        vmov.32         r0,  d3[0]
> > +        asr             r0,  r3
> > +        bx              lr
> > +        .endfunc
> 
> This doesn't do exactly the same thing as the C version, which shifts
> immediately after the multiplication, before accumulating.  However,
> all calls to DSPContext.scalarproduct_int16 have a zero shift.
> 
> Since shifting at the end is both more accurate and faster, maybe we
> should change it.  Someone would have to update the sse and altivec
> versions of course.

The intent was to have sped-up scalar product calculating for Monkey
Audio but with CELP filters in mind too. Since those use fixed point
values, shift right after multiplication is logical there (and will
prevent overflows).

As for this version - I seem unable to find instruction for vector
right shift by register value, only by immediate ones (which looks like
discrimination of the rightshifts).

> -- 
> M?ns Rullg?rd