[FFmpeg-devel] [PATCH] lavc/aarch64: add a few SIMD function for AAC PS

James Almer jamrial at gmail.com
Thu May 25 19:22:22 EEST 2017


On 5/25/2017 12:50 PM, Clément Bœsch wrote:
> ---
> 
> This is still not benchmarked (written and verified with qemu).
> 
> I typically wrote an alternative implementation for
> stereo_interpolate[0] which needs to be compared with the current one:
> 
> function ff_ps_stereo_interpolate_neon, export=1
>         ld1         {v0.4S}, [x2]
>         ld1         {v1.4S}, [x3]
> 1:
>         ld1         {v2.2S}, [x0]
>         ld1         {v3.2S}, [x1]
>         fadd        v0.4S, v0.4S, v1.4S
>         fmul        v4.2S, v2.2S, v0.S[0]
>         fmul        v5.2S, v2.2S, v0.S[1]
>         fmla        v4.2S, v3.2S, v0.S[2]
>         fmla        v5.2S, v3.2S, v0.S[3]
>         st1         {v4.2S}, [x0], #8
>         st1         {v5.2S}, [x1], #8
>         subs        w4, w4, #1
>         b.gt        1b
>         ret
> endfunc
> 
> I don't know which is faster. For now, the current version follows the
> logic I used in stereo_interpolate[1] (the ipdopd one). It's doing less
> mult operations, but more shuffling.
> 
> A 3rd alternative would be possible if it was possible to assume len % 2
> was always true (allowing overreading and overwriting by one more entry
> basically). Currently, this is not the case.
> 
> Speaking of ipdopd, the factors table and the ext may be clumsy.
> ---

[...]

> +function ff_ps_stereo_interpolate_ipdopd_neon, export=1
> +        movrel      x5, ipdopd_factors
> +        ld1         {v20.4S}, [x5]
> +        ld1         {v0.4S,v1.4S}, [x2]
> +        ld1         {v6.4S,v7.4S}, [x3]
> +1:
> +        ld1         {v2.2S}, [x0]
> +        ld1         {v3.2S}, [x1]
> +        dup         v2.2D, v2.D[0]
> +        dup         v3.2D, v3.D[0]
> +        fadd        v0.4S, v0.4S, v6.4S
> +        fadd        v1.4S, v1.4S, v7.4S
> +        zip1        v16.4S, v0.4S, v0.4S
> +        zip2        v17.4S, v0.4S, v0.4S
> +        zip1        v18.4S, v1.4S, v1.4S
> +        zip2        v19.4S, v1.4S, v1.4S
> +        fmul        v4.4S, v2.4S, v16.4S
> +        fmla        v4.4S, v3.4S, v17.4S
> +        ext         v2.16B, v2.16B, v2.16B, #4
> +        ext         v3.16B, v3.16B, v3.16B, #4

> +        fmul        v5.4S, v2.4S, v18.4S
> +        fmla        v5.4S, v3.4S, v19.4S
> +        fmla        v4.4S, v5.4S, v20.4S

You could make ipdopd_factors be 0, INT32_MIN, 0, INT32_MIN then replace
the fmla with eor + fadd.
No idea if that will actually be faster, though.


More information about the ffmpeg-devel mailing list