[FFmpeg-devel] [PATCH] swresample/arm: add ff_resample_common_apply_filter_{x4, x8}_{float, s16}_neon

Wed May 11 21:04:22 CEST 2016

On 11.05.2016, at 20:37, Michael Niedermayer <michael at niedermayer.cc> wrote:

> On Wed, May 11, 2016 at 06:39:20PM +0200, Matthieu Bouron wrote:
>> From: Matthieu Bouron <matthieu.bouron at stupeflix.com>
>> 
>> ---
>> 
>> Hello,
>> 
>> Here are some benchmark on a rpi2 of the attached patch.
>> 
>> ./ffmpeg -f lavfi -i sine=440,aformat=sample_fmts=fltp,asetnsamples=4096,abench=start,aresample=48000,abench=stop -t 1000 -f null -
>> 
>> With patch:    avg=0.001159 speed=44,1x
>> Without patch: avg=0.001297 speed=40,8x
>> 
>> ./ffmpeg -f lavfi -i sine=440,aformat=sample_fmts=s16p,asetnsamples=4096,abench=start,aresample=48000,abench=stop -t 1000 -f null -
>> 
> 
>> With patch:    avg=0.001374 speed=45,6x
>> Without patch: avg=0.000782 speed=64,6x
> 
> so its slower ? or am i misreading this ?

Yes, that seems weird.
Also, what are common filter lengths?
Because for a length of 4 or 8 or 16 I'd think this would be much better fully unrolled.
And for longer ones at least partially unrolled.
Also having the filter length if inside the outer loop in the C code does not seem ideal either, even if the compiler might manage to fix it.
There's also the problem that on simple CPUs like most ARM, the jump overhead seems likely significant, so this might be a case where inline assembly might provide significant benefits (or writing the whole function in assembly), otherwise there's a risk that enabling the recently discussed -ftree-vecorize for that file specifically would give better results.