[FFmpeg-devel] [PATCH] avfilter/avf_showcqt: cqt_calc optimization on x86

Wed Jun 8 18:34:48 CEST 2016

On 6/7/2016 6:18 AM, Muhammad Faiz wrote:
>>> +        sub     lend, 2
>>> >> +        lea     dstq, [dstq + 16]
>> >
>> > Use add
>> >
>>> >> +        lea     coeffsq, [coeffsq + 2*Coeffs.sizeof]
>> >
>> > Same, assuming sizeof is an immediate.
>> >
> This is optimization to separate sub and jnz with lea.

Does it make a noticeable difference? We don't really bother
keeping the instruction that sets the flag and the one that
checks it separated anywhere else, so i wonder.

> Using add will clobber flag register.

You could keep the coeffsq lea but change the dstq into an add
and move it above the sub instruction. No need for two slow leas
for the above purpose.

> Also lea does not need rex prefix

add dstq, 16 seems to be four bytes, same as lea dstq [dstq+16],
at least on win64.

>>> +    SELECT_CQT_CALC(sse,  SSE,  4, 0);
>>> >> +    SELECT_CQT_CALC(sse3, SSE3, 4, 0);
>>> >> +    SELECT_CQT_CALC(avx,  AVX,  8, 01452367);
>> >
>> > Use AVX_FAST, so this function will not be used on CPUs that set the
>> > AV_CPU_FLAG_AVXSLOW flag.
>> >
>>> >> +    SELECT_CQT_CALC(fma3, FMA3, 8, 01452367);
>> >
>> > Same, use FMA3_FAST. The result will then be the FMA3 version used by
>> > Intel CPUs and hopefully AMD Zen, and the FMA4 one by Bulldozer based
>> > CPUs.
>> >
>>> >> +    SELECT_CQT_CALC(fma4, FMA4, 8, 01452367);
>>> >> +}
>>> >>
>> >
> OK, also reorder (FMA4 before AVX because AVX/ymm without FMA4 is faster than
> FMA4/xmm)

Not in any machine supporting FMA4. And they will not use the
AVX or FMA3 functions here because they set the avxslow flag,
which the EXTERNAL_*_FAST macros check for.

You could have kept the FMA4 init at the end and it would be
functionally the same, but in any case it's purely cosmetic.