[FFmpeg-devel] [PATCH v2 2/3] avfilter/x86/vf_exposure: add ff_exposure_avx2

Sat Nov 20 23:28:26 EET 2021

On 11/20/2021 5:42 PM, Wu Jianhua wrote:
> James Almer<mailto:jamrial at gmail.com>:
> On 11/4/2021 1:18 AM, Wu Jianhua wrote:
>>> Performance data(Less is better):
>>>       exposure_sse:   500491
> 
>> You reported a better result in the first patch.
> 
> For they are tested on different baseline, I think it might be better to only compare these two values.
> 
>>>       exposure_avx2:  449122
> 
>> This looks like a really low speed up for a function that processes
>>   twice the amount of floats per loop.
> 
>>>
>>> Signed-off-by: Wu Jianhua <jianhua.wu at intel.com>
>>> ---
>>>    libavfilter/x86/vf_exposure.asm    | 15 +++++++++++++++
>>>    libavfilter/x86/vf_exposure_init.c |  6 ++++++
>>>    2 files changed, 21 insertions(+)
>>>
>>> diff --git a/libavfilter/x86/vf_exposure.asm b/libavfilter/x86/vf_exposure.asm
>>> index 3351c6fb3b..f271167805 100644
>>> --- a/libavfilter/x86/vf_exposure.asm
>>> +++ b/libavfilter/x86/vf_exposure.asm
>>> @@ -36,11 +36,21 @@ cglobal exposure, 2, 2, 4, ptr, length, black, scale
>>>        VBROADCASTSS m1, xmm1
>>>    %endif
>>>
>>> +%if cpuflag(fma3) || cpuflag(fma4)
> 
>> Remove the fma4 check if you're not using it.
> 
> No problem. Avx2 flag is only initialized with fma3, so the fma4 is redundant indeed.
> 
>>> +    mulps       m0, m0, m1 ; black * scale
>>> +%endif
>>> +
>>>    .loop:
>>> +%if cpuflag(fma3) || cpuflag(fma4)
>>> +    mova        m2, m0
>>> +    vfmsub231ps m2, m1, [ptrq]
>>> +    movu    [ptrq], m2
> 
>> Have you tried to not use FMA for this and just kept the sub + mul even
>> for AVX2 and see how it performs?
> 
> Yeah. Definitely. I have had sufficient tests before. The first version is kept sub + mul
> for AVX2. After that, I keep trying to find a way out to speed up it further. Using FMA
> here would be faster than sub + mul indeed, precisely, improving by 4%-10% approximately.
> Not that much better, but still an optimal way I found at the present.

I tried the checkasm test you wrote and when i made the AVX2 version use 
sub + mul instead of vfmsub231ps i noticed that i could change the 
epsilon value to FLT_EPSILON instead of 0.01f and the test would still 
succeed, meaning the output of the version using vfmsub231ps deviates a 
bit from the normal sub + mul one.

The speed up is pretty small, so it may be worth just using the sub + 
mul version instead.

> 
>>>   +%else
>>>        movu        m2, [ptrq]
>>>        subps       m2, m2, m0
>>>        mulps       m2, m2, m1
>>>        movu    [ptrq], m2
>>>   +%endif
>>>        add       ptrq, mmsize
>>>        sub    lengthq, mmsize/4
>>>
>>>   @@ -52,4 +62,9 @@ cglobal exposure, 2, 2, 4, ptr, length, black, scale
>>>    %if ARCH_X86_64
>>>    INIT_XMM sse
>>>    EXPOSURE
>>>   +
>>>   +%if HAVE_AVX2_EXTERNAL
>>>   +INIT_YMM avx2
>>>   +EXPOSURE
>>>   +%endif
>>>    %endif
>>>   diff --git a/libavfilter/x86/vf_exposure_init.c b/libavfilter/x86/vf_exposure_init.c
>>>   index de1b360f6c..80dae6164e 100644
>>>   --- a/libavfilter/x86/vf_exposure_init.c
>>>   +++ b/libavfilter/x86/vf_exposure_init.c
>>>   @@ -24,6 +24,7 @@
>>>    #include "libavfilter/exposure.h"
>>>
>>>    void ff_exposure_sse(float *ptr, int length, float black, float scale);
>>>   +void ff_exposure_avx2(float *ptr, int length, float black, float scale);
>>>
>>>    av_cold void ff_exposure_init_x86(ExposureContext *s)
>>>   {
>>>   @@ -32,5 +33,10 @@ av_cold void ff_exposure_init_x86(ExposureContext *s)
>>>    #if ARCH_X86_64
>>>        if (EXTERNAL_SSE(cpu_flags))
>>>            s->exposure_func = ff_exposure_sse;
>>>   +
>>>   +#if HAVE_AVX2_EXTERNAL
> 
>> No need for this preprocessor check.
> 
> Got it. I’ll update it.
> 
> Thanks for your review.
> Jianhua
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
>