[FFmpeg-devel] [PATCH] x86/dcadec: add ff_lfe_fir1_float_{sse3, avx}
James Almer
jamrial at gmail.com
Tue Feb 23 00:01:34 CET 2016
On 2/22/2016 7:44 PM, Christophe Gisquet wrote:
> Hi,
>
> 2016-02-22 22:43 GMT+01:00 James Almer <jamrial at gmail.com>:
>> +.loop:
>> +%if cpuflag(avx)
>> + cvtdq2ps m4, [lfeq]
>> + shufps m5, m4, m4, q0123
>> +%elif cpuflag(sse2)
>> + movu m4, [lfeq]
>> + cvtdq2ps m4, m4
>> + pshufd m5, m4, q0123
>> +%endif
>> +
>> +.inner_loop:
>> +%if ARCH_X86_64
>> + movaps m6, [coeffq+cnt1q*4 ]
>> + movaps m7, [coeffq+cnt1q*4+16]
>> + movaps m8, [coeffq+cnt1q*4+32]
>> + movaps m9, [coeffq+cnt1q*4+48]
>> + mulps m0, m5, m6
>> + mulps m1, m5, m7
>> + mulps m2, m5, m8
>> + mulps m3, m5, m9
>> +%else
>> + movaps m6, [coeffq+cnt1q*4 ]
>> + movaps m7, [coeffq+cnt1q*4+16]
>> + mulps m0, m5, m6
>> + mulps m1, m5, m7
>> + mulps m2, m5, [coeffq+cnt1q*4+32]
>> + mulps m3, m5, [coeffq+cnt1q*4+48]
>> +%endif
>
> Is OOE the reason why you don't move the common code out of those
> conditional blocks? Otherwise, it looks cleaner to me to do:
Not really. I just thought having x86_64 and X86_32 clearly separated
was easier to read.
> movaps m6, [coeffq+cnt1q*4 ]
> movaps m7, [coeffq+cnt1q*4+16]
> mulps m0, m3, m6
> mulps m1, m3, m7
> %if ARCH_X86_64
> movaps m8, [coeffq+cnt1q*4+32]
> movaps m9, [coeffq+cnt1q*4+48]
> mulps m2, m5, m8
> mulps m3, m5, m9
> %else
> mulps m2, m5, [coeffq+cnt1q*4+32]
> mulps m3, m5, [coeffq+cnt1q*4+48]
> %endif
> and let OOE do its job.
>
> Secondly, m5 is not reused afterwards, so maybe replace m5 by m3 for
> all code up to this, and load something into m5 instead?
m5 and m4 contain the lfe samples. I can't reuse them inside the inner
loop.
>
>> + haddps m0, m1
>> + haddps m2, m3
>> + haddps m0, m2
>> + movaps [samplesq+cnt1q], m0
>
> I suppose you've already looked at most arrangements that would help
> doing fewer shuffles. And I don't see any obvious one either.
>
More information about the ffmpeg-devel
mailing list