[FFmpeg-devel] [PATCH] swresample/arm: add ff_resample_common_apply_filter_{x4, x8}_{float, s16}_neon

Fri May 13 21:32:25 CEST 2016

On Thu, May 12, 2016 at 3:50 PM, Benoit Fouet <benoit.fouet at free.fr> wrote:

> Hi,
>
>
> On 12/05/2016 15:22, Matthieu Bouron wrote:
>
>> On Thu, May 12, 2016 at 10:01 AM, Benoit Fouet <benoit.fouet at free.fr>
>> wrote:
>>
>> Hi,
>>>
>>> I mostly have nits remarks.
>>>
>>> On 11/05/2016 18:39, Matthieu Bouron wrote:
>>>
>>> From: Matthieu Bouron <matthieu.bouron at stupeflix.com>
>>>>
>>>>
>>>> [...]
>>>
>>> diff --git a/libswresample/arm/resample.S b/libswresample/arm/resample.S
>>>
>>>> new file mode 100644
>>>> index 0000000..13462e3
>>>> --- /dev/null
>>>> +++ b/libswresample/arm/resample.S
>>>> @@ -0,0 +1,77 @@
>>>>
>>>> [...]
>>>>
>>>> +function ff_resample_common_apply_filter_x4_float_neon, export=1
>>>> +    vmov.f32            q0, #0.0
>>>>  @
>>>> accumulator
>>>> +1:  vld1.32             {q1}, [r1]!
>>>> @
>>>> src
>>>> +    vld1.32             {q2}, [r2]!
>>>> @
>>>> filter
>>>> +    vmla.f32            q0, q1, q2
>>>>  @
>>>> src + {0..3} * filter + {0..3}
>>>>
>>>> nit: the comment could be "accu += src[0..3] . filter[0..3]"
>>> same for the other ones below
>>>
>>> [...]
>>>
>>> +    subs                r3, #4                                         @
>>>
>>>> filter_length -= 4
>>>> +    bgt                 1b
>>>>  @
>>>> loop until filter_length
>>>> +    vpadd.f32           d0, d0, d1
>>>>  @
>>>> pair adding of the 4x32-bit accumulated values
>>>> +    vpadd.f32           d0, d0, d0
>>>>  @
>>>> pair adding of the 4x32-bit accumulator values
>>>> +    vst1.32             {d0[0]}, [r0]
>>>> @
>>>> write accumulator
>>>> +    mov pc, lr
>>>> +endfunc
>>>> +
>>>> +function ff_resample_common_apply_filter_x8_float_neon, export=1
>>>> +    vmov.f32            q0, #0.0
>>>>  @
>>>> accumulator
>>>> +1:  vld1.32             {q1}, [r1]!
>>>> @
>>>> src1
>>>> +    vld1.32             {q2}, [r2]!
>>>> @
>>>> filter1
>>>> +    vld1.32             {q8}, [r1]!
>>>> @
>>>> src2
>>>> +    vld1.32             {q9}, [r2]!
>>>> @
>>>> filter2
>>>> +    vmla.f32            q0, q1, q2
>>>>  @
>>>> src1 + {0..3} * filter1 + {0..3}
>>>> +    vmla.f32            q0, q8, q9
>>>>  @
>>>> src2 + {0..3} * filter2 + {0..3}
>>>>
>>>> instead of using src1 and src2, you may want to use src[0..3] and
>>> src[4..7]
>>> so, if I reuse the formulation I proposed above:
>>> accu += src[0..3] . filter[0..3]
>>> accu += src[4..7] . filter[4..7]
>>>
>>> Fixed locally (as well as the other case you mentionned) with:
>> -    vmla.f32            q0, q1, q2                                     @
>> src1 + {0..3} * filter1 + {0..3}
>> -    vmla.f32            q0, q8, q9                                     @
>> src2 + {0..3} * filter2 + {0..3}
>> +    vmla.f32            q0, q1, q2                                     @
>> accumulator += src1 + {0..3} * filter1 + {0..3}
>> +    vmla.f32            q0, q8, q9                                     @
>> accumulator += src2 + {4..7} * filter2 + {4..7}
>>
>> I prefer to use + {0..3} instead of [0..3] to make the comments consistent
>> with what has been done in swscale/arm.
>>
>>
> Fine for me (I chose the "[]" notation to be consistent with the "."
> notation also, in order to do as if it were a dot product between two
> vectors).
>
>
> +    subs                r3, #8                                         @
>>>
>>>> filter_length -= 4
>>>>
>>>> -= 8
>>>
>>> Fixed locally.
>>
>>
>> [...]
>>>
>>> diff --git a/libswresample/arm/resample_init.c
>>>
>>>> b/libswresample/arm/resample_init.c
>>>> new file mode 100644
>>>> index 0000000..c817d03
>>>> --- /dev/null
>>>> +++ b/libswresample/arm/resample_init.c
>>>>
>>>> [...]
>>>>
>>>> +static int ff_resample_common_##TYPE##_neon(ResampleContext *c, void
>>>> *dest, const void *source,   \
>>>> +                                            int n, int update_ctx)
>>>>                            \
>>>> +{
>>>>                           \
>>>> +    DELEM *dst = dest;
>>>>                            \
>>>> +    const DELEM *src = source;
>>>>                            \
>>>> +    int dst_index;
>>>>                            \
>>>> +    int index= c->index;
>>>>                            \
>>>> +    int frac= c->frac;
>>>>                            \
>>>> +    int sample_index = index >> c->phase_shift;
>>>>                           \
>>>> +    int x4_aligned_filter_length = c->filter_length & ~3;
>>>>                           \
>>>> +    int x8_aligned_filter_length = c->filter_length & ~7;
>>>>                           \
>>>> +
>>>>                            \
>>>> +    index &= c->phase_mask;
>>>>                           \
>>>> +    for (dst_index = 0; dst_index < n; dst_index++) {
>>>>                           \
>>>> +        FELEM *filter = ((FELEM *) c->filter_bank) + c->filter_alloc *
>>>> index;                     \
>>>> +
>>>>                            \
>>>> +        FELEM2 val=0;
>>>>                           \
>>>> +        int i = 0;
>>>>                            \
>>>> +        if (x8_aligned_filter_length >= 8) {
>>>>                            \
>>>> +            ff_resample_common_apply_filter_x8_##TYPE##_neon(&val,
>>>> &src[sample_index],            \
>>>> +                                                             filter,
>>>> x8_aligned_filter_length);   \
>>>> +            i += x8_aligned_filter_length;
>>>>                            \
>>>> +
>>>>                            \
>>>> +        } else if (x4_aligned_filter_length >= 4) {
>>>>                           \
>>>>
>>>> do you think there could be a gain processing the remainder of the
>>> 8-aligned part through the 4-aligned part of the code? e.g. for a filter
>>> length of 15, that would make:
>>>   - one run of the 8-aligned
>>>   - one run of the 4-aligned
>>>   - 3 C loops
>>> As you stated filter length seems to commonly be 32, I guess that
>>> wouldn't
>>> be easy to benchmark, though.
>>>
>>> I'll see if I could trigger a case where the filter_length is 15 and come
>> up with a benchmark. If there is a performance gain, would you be ok if it
>> is implemented as a separate patch ?
>>
>
> Sure, no problem for me.
>

Pushed. Thanks.

Matthieu

[...]