[FFmpeg-devel] [PATCH] NEON code for basic scalar ops

Thu Aug 13 11:51:36 CEST 2009

Kostya <kostya.shishkov at gmail.com> writes:

> On Thu, Aug 13, 2009 at 12:33:07AM +0100, M?ns Rullg?rd wrote:
>> Kostya <kostya.shishkov at gmail.com> writes:
>> 
>> > On Tue, Jul 21, 2009 at 03:23:58PM +0100, M?ns Rullg?rd wrote:
>> >> Kostya <kostya.shishkov at gmail.com> writes:
>> >> 
>> >> > While waiting for RTMP patch review, here's a bit of NEON code to speed
>> >> > up int16 array addition/subtraction and scalar product calculation.
>> >> >
>> >> > This about halves decoding time for APE compressed at insane level
>> >> > (so it's only 7 times slower than realtime on my BeagleBoard).
>> >> 
>> >> These functions are far from optimal.
>> >
>> > Since I won't be able to work at it for some time I post here version
>> > that is few cycles closer to optimal (but still far away).
>> >
>> > +function ff_scalarproduct_int16_neon, export=1
>> > +        vmov.i16        q0,  #0
>> > +        vmov.i16        q1,  #0
>> > +        vmov.i16        q2,  #0
>> > +        vmov.i16        q3,  #0
>> > +1:      vld1.16         {d16-d17}, [r0]!
>> > +        vld1.16         {d20-d21}, [r1,:128]!
>> > +        vmlal.s16       q0,  d16,  d20
>> > +        vld1.16         {d18-d19}, [r0]!
>> > +        vmlal.s16       q1,  d17,  d21
>> > +        vld1.16         {d22-d23}, [r1,:128]!
>> > +        vmlal.s16       q2,  d18,  d22
>> > +        vmlal.s16       q3,  d19,  d23
>> > +        subs            r2,  r2,   #16
>> > +        bne             1b
>> > +        vpadd.s32       d8,  d0,   d1
>> > +        vpadd.s32       d9,  d2,   d3
>> > +        vpadd.s32       d10, d4,   d5
>> > +        vpadd.s32       d11, d6,   d7
>> > +        vpadd.s32       d0,  d8,   d9
>> > +        vpadd.s32       d1,  d10,  d11
>> > +        vpadd.s32       d2,  d0,   d1
>> > +        vpaddl.s32      d3,  d2
>> > +        vmov.32         r0,  d3[0]
>> > +        asr             r0,  r3
>> > +        bx              lr
>> > +        .endfunc
>> 
>> This doesn't do exactly the same thing as the C version, which shifts
>> immediately after the multiplication, before accumulating.  However,
>> all calls to DSPContext.scalarproduct_int16 have a zero shift.
>> 
>> Since shifting at the end is both more accurate and faster, maybe we
>> should change it.  Someone would have to update the sse and altivec
>> versions of course.
>
> The intent was to have sped-up scalar product calculating for Monkey
> Audio but with CELP filters in mind too. Since those use fixed point
> values, shift right after multiplication is logical there (and will
> prevent overflows).

If you shift after multiplying, you can't use multiply-accumulate
instructions.

-- 
M?ns Rullg?rd
mans at mansr.com