[FFmpeg-devel] [PATCH] swr/resample: use fma when it is faster

James Almer jamrial at gmail.com
Mon Dec 14 00:55:12 CET 2015


On 12/13/2015 8:08 PM, Ganesh Ajjanagadde wrote:
> On Sun, Dec 13, 2015 at 5:55 PM, Ganesh Ajjanagadde
> <gajjanagadde at gmail.com> wrote:
>> On Sun, Dec 13, 2015 at 5:47 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>>> Hi,
>>>
>>> On Sun, Dec 13, 2015 at 4:59 PM, Ganesh Ajjanagadde <gajjanagadde at gmail.com>
>>> wrote:
>>>>
>>>> fma is a faster function on architectures supporting a native CPU
>>>> instruction for it.
>>>> This may be tested by the ISO C optionally defined FP_FAST_FMA. Although
>>>> in the x86 lineup this came fairly late
>>>> (from Haswell onwards, and hence is absent unless appropriate -march is
>>>> passed),
>>>> numerous other architectures support it:
>>>> https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation.
>>>>
>>>> Concretely, one can expect ~ 15-25% speedup that is of course heavily
>>>> architecture dependent.
>>>>
>>>> This patch also ensures that as people migrate to newer CPU's, the
>>>> benefit will slowly trickle in.
>>>>
>>>> I doubt this will cause build failures on broken libm's since I can't
>>>> imagine a platform where FP_FAST_FMA is defined but the function fma is
>>>> absent.
>>>>
>>>> Sample benchmark (x86-64, Haswell, GNU/Linux under -march=native)
>>>>
>>>> old:
>>>> 515828458 decicycles in build_filter (loop 1000),    1024 runs,      0
>>>> skips
>>>>
>>>> new (fma):
>>>> 435866377 decicycles in build_filter (loop 1000),    1024 runs,      0
>>>> skips
>>>>
>>>> Tested with FATE.
>>>>
>>>> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
>>>> ---
>>>>  libswresample/resample.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/libswresample/resample.c b/libswresample/resample.c
>>>> index 34eb4c0..e61d4c5 100644
>>>> --- a/libswresample/resample.c
>>>> +++ b/libswresample/resample.c
>>>> @@ -33,8 +33,12 @@ static inline double eval_poly(const double *coeff, int
>>>> size, double x) {
>>>>      double sum = coeff[size-1];
>>>>      int i;
>>>>      for (i = size-2; i >= 0; --i) {
>>>> +#ifdef FP_FAST_FMA
>>>> +        sum = fma(sum, x, coeff[i]);
>>>> +#else
>>>>          sum *= x;
>>>>          sum += coeff[i];
>>>> +#endif
>>>>      }
>>>>      return sum;
>>>>  }
>>>> --
>>>> 2.6.4
>>>
>>>
>>> Nope, this is not how we do CPU-specific optimizations. Check example
>>> implementations in libswresample/x86/*.asm and the related init functions
>>> plus macros to check for runtime cpu support in libswresample/x86/*_init.c.
>>> You want to follow that pattern.
>>
>> No, this is not x86 specific. This is generic code. If I did such a
>> maneouver, benefits would apply only to x86, an inferior outcome.
> 
> To clarify: yes, in theory one could dump such things into
> swresample/x86, swresample/aarch64, and a ton of other architectures
> (for which some arches are actually lacking). Such a diff is far
> larger and more brittle - I can't even test things like mips and the
> like, and looking up the manuals for each and every one of these to
> find out when/what is the fma equivalent is a pain in the neck.
> 
> ISO C provides a mechanism, albeit build-time and not runtime detection.
> 
> This patch is thus something that gives benefits at minimal scope for
> regressions. Unless others show where/how fma detection can be done
> for all arches (aarch64, arm, mips, powerpc, itanium, etc in addition
> to x86-64), I view your idea as future work.

FP_FAST_FMA is apparently not defined on mingw-w64 even though it supports
fma() and generates FMA3/4 instructions when targeting relevant CPUs.

I also noticed that GCC will on x86_32 generate a call to an external fma
function instead of inlining the relevant FMA3/4 instructions, same as it
does when the target lacks fast fma instructions, so simply checking the
target CPU is not enough. On said builds this patch will probably mean a
slowdown. No idea what GCC does with other arches.


More information about the ffmpeg-devel mailing list