[FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

Ganesh Ajjanagadde gajjanag at mit.edu
Thu Jan 14 23:47:23 CET 2016

On Thu, Jan 14, 2016 at 11:48 AM, James Almer <jamrial at gmail.com> wrote:
> On 1/14/2016 1:26 PM, Ganesh Ajjanagadde wrote:
>> On Thu, Jan 14, 2016 at 11:16 AM, James Almer <jamrial at gmail.com> wrote:
>>> On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote:
>>>> On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner <henrik at gramner.com> wrote:
>>>>> Use the x86inc syntax for FMA instructions (basically FMA4 syntax that
>>>>> gets assembled as FMA3) since normal FMA3 opcodes are horrible to
>>>>> read, nobody ever remembers the ordering of operands.
>>>> 1. It is very easy to remember: take fmadd231pd x, y, z for instance.
>>>> This means 2*3 + 1, so x = y*z+x. How the macro is more readable is
>>>> beyond me; especially with some side cases that are undocumented, see
>>>> below.
>>> fmaddps dst, src1, src2, src3 is always going to be easier to read for anyone
>>> without having to think about what number belongs to what operation and what
>>> operand. And it will output either FMA4 or FMA3 depending on the value passed
>>> to INIT_[XY]MM.
>> The fma3/fma4 thing is the only benefit. Even that is generally not a
>> big deal; AMD quickly started supporting fma3.
> Nobody is asking you to write an FMA4 version of this function. We're asking
> you to use the x86inc FMA4-like macros for readability purposes.
>>>> 2. If anything, the macro is harder, since it is not Intel supported,
>>> Of course it wont be there, it's not defined by them. Non-destructive four
>>> operand fma is defined by AMD.
>> Of course I know this.
>>>> I can't look it up at
>>>> https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf.
>>> Neither are any of the dozens other compat macros in x86utils. And many of
>>> them are also undocumented within x86utils. This point is absurd.
>> How is it absurd? You expect me to use something that lacks clear
>> documentation, and claim that it is "more readable". What other macros
>> have/lack is irrelevant to the point.
> If you want documentation for FMA4 look at AMD docs, just like you didn't
> hesitate to look at Intel's.
>>>> 3. The macro does not seem to take care of the mov's (if any), still
>>>> requiring explicit thought on the part of the programmer.
>>> Yes, and? It's not an emulation macro like the uppercase ones that become
>>> several instructions. It translate a single FMA4-like instruction into
>>> either an FMA4 or FMA3 one.
>>> fmaddps xmm0, xmm0, xmm1, xmm2
>>> becomes
>>> vfmaddps xmm0, xmm0, xmm1, xmm2 if FMA4
>>> vfmadd132ps xmm0, xmm2, xmm1 if FMA3
>>> If you try to use it with four different operands, it will work with FMA4
>>> but not FMA3, since as i said it's not trying to emulate anything.
>> Thanks for mentioning the convention; but this is an important one and
>> AFAIK not mentioned in any documentation within FFmpeg.
>>>> 4. The macro lacks documentation. In particular, it is not a thorough
>>>> fma4 emulation in the spirit of
>>>> https://gist.github.com/rygorous/22180ced9c7a00bd68dd.
>>>> Or put in other words, IMO not good.
>>> No, it's good and what's done in every other asm file precisely for being
>>> more flexible and readable.
>> Flexibility, yes, readability still no.
> dst = src1 * src2 + src3
> That's all you need to know to read an FMA4-like instruction. Are you going to
> tell me that the clusterfuck that's FMA3 with varying numbers that change the
> order or operations and meaning of operands is easier to read?

BTW, this is why I personally don't like the macro:
so I was moving along, replacing one after the other, till I came to this line
    vfmadd213pd ymm1, ymm5, COVAR(iq  ,1)
I naturally replace by
    fmaddpd ymm1, ymm1, ymm5, COVAR(iq,1)
giving error "invalid combination of opcode and operand"
I could spend the time seeing why it is broken, but frankly don't
care. The point is, the macro is broken, and the lack of documentation
just bit back.
    fmaddpd ymm1, ymm5, ymm1, COVAR(iq,1)
works though (switch order of mult).
And the idea of just looking at the amd docs does not help either,
both are perfectly fine for fma4.

All said, patchv2 posted.

> With the compat macros in x86inc, as long as two of the four operands are the
> same register then it's going to output the relevant FMA3 instruction for you.
>>> Especially since it allows one to write both
>>> FMA4 and FMA3 functions without duplicating code.
>> Fine.
>>> _______________________________________________
>>> ffmpeg-devel mailing list
>>> ffmpeg-devel at ffmpeg.org
>>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel at ffmpeg.org
>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

More information about the ffmpeg-devel mailing list