[FFmpeg-devel] [PATCH] NEON code for basic scalar ops
Måns Rullgård
mans
Thu Aug 13 01:33:07 CEST 2009
Kostya <kostya.shishkov at gmail.com> writes:
> On Tue, Jul 21, 2009 at 03:23:58PM +0100, M?ns Rullg?rd wrote:
>> Kostya <kostya.shishkov at gmail.com> writes:
>>
>> > While waiting for RTMP patch review, here's a bit of NEON code to speed
>> > up int16 array addition/subtraction and scalar product calculation.
>> >
>> > This about halves decoding time for APE compressed at insane level
>> > (so it's only 7 times slower than realtime on my BeagleBoard).
>>
>> These functions are far from optimal.
>
> Since I won't be able to work at it for some time I post here version
> that is few cycles closer to optimal (but still far away).
>
> +function ff_scalarproduct_int16_neon, export=1
> + vmov.i16 q0, #0
> + vmov.i16 q1, #0
> + vmov.i16 q2, #0
> + vmov.i16 q3, #0
> +1: vld1.16 {d16-d17}, [r0]!
> + vld1.16 {d20-d21}, [r1,:128]!
> + vmlal.s16 q0, d16, d20
> + vld1.16 {d18-d19}, [r0]!
> + vmlal.s16 q1, d17, d21
> + vld1.16 {d22-d23}, [r1,:128]!
> + vmlal.s16 q2, d18, d22
> + vmlal.s16 q3, d19, d23
> + subs r2, r2, #16
> + bne 1b
> + vpadd.s32 d8, d0, d1
> + vpadd.s32 d9, d2, d3
> + vpadd.s32 d10, d4, d5
> + vpadd.s32 d11, d6, d7
> + vpadd.s32 d0, d8, d9
> + vpadd.s32 d1, d10, d11
> + vpadd.s32 d2, d0, d1
> + vpaddl.s32 d3, d2
> + vmov.32 r0, d3[0]
> + asr r0, r3
> + bx lr
> + .endfunc
This doesn't do exactly the same thing as the C version, which shifts
immediately after the multiplication, before accumulating. However,
all calls to DSPContext.scalarproduct_int16 have a zero shift.
Since shifting at the end is both more accurate and faster, maybe we
should change it. Someone would have to update the sse and altivec
versions of course.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list