[FFmpeg-devel] [PATCH] NEON code for basic scalar ops

Thu Aug 13 01:33:07 CEST 2009

Kostya <kostya.shishkov at gmail.com> writes:

> On Tue, Jul 21, 2009 at 03:23:58PM +0100, M?ns Rullg?rd wrote:
>> Kostya <kostya.shishkov at gmail.com> writes:
>> 
>> > While waiting for RTMP patch review, here's a bit of NEON code to speed
>> > up int16 array addition/subtraction and scalar product calculation.
>> >
>> > This about halves decoding time for APE compressed at insane level
>> > (so it's only 7 times slower than realtime on my BeagleBoard).
>> 
>> These functions are far from optimal.
>
> Since I won't be able to work at it for some time I post here version
> that is few cycles closer to optimal (but still far away).
>
> +function ff_scalarproduct_int16_neon, export=1
> +        vmov.i16        q0,  #0
> +        vmov.i16        q1,  #0
> +        vmov.i16        q2,  #0
> +        vmov.i16        q3,  #0
> +1:      vld1.16         {d16-d17}, [r0]!
> +        vld1.16         {d20-d21}, [r1,:128]!
> +        vmlal.s16       q0,  d16,  d20
> +        vld1.16         {d18-d19}, [r0]!
> +        vmlal.s16       q1,  d17,  d21
> +        vld1.16         {d22-d23}, [r1,:128]!
> +        vmlal.s16       q2,  d18,  d22
> +        vmlal.s16       q3,  d19,  d23
> +        subs            r2,  r2,   #16
> +        bne             1b
> +        vpadd.s32       d8,  d0,   d1
> +        vpadd.s32       d9,  d2,   d3
> +        vpadd.s32       d10, d4,   d5
> +        vpadd.s32       d11, d6,   d7
> +        vpadd.s32       d0,  d8,   d9
> +        vpadd.s32       d1,  d10,  d11
> +        vpadd.s32       d2,  d0,   d1
> +        vpaddl.s32      d3,  d2
> +        vmov.32         r0,  d3[0]
> +        asr             r0,  r3
> +        bx              lr
> +        .endfunc

This doesn't do exactly the same thing as the C version, which shifts
immediately after the multiplication, before accumulating.  However,
all calls to DSPContext.scalarproduct_int16 have a zero shift.

Since shifting at the end is both more accurate and faster, maybe we
should change it.  Someone would have to update the sse and altivec
versions of course.

-- 
M?ns Rullg?rd
mans at mansr.com