[FFmpeg-devel] [PATCH] x86/dcadec: add ff_lfe_fir1_float_{sse3, avx}
Christophe Gisquet
christophe.gisquet at gmail.com
Mon Feb 22 23:44:05 CET 2016
Hi,
2016-02-22 22:43 GMT+01:00 James Almer <jamrial at gmail.com>:
> +.loop:
> +%if cpuflag(avx)
> + cvtdq2ps m4, [lfeq]
> + shufps m5, m4, m4, q0123
> +%elif cpuflag(sse2)
> + movu m4, [lfeq]
> + cvtdq2ps m4, m4
> + pshufd m5, m4, q0123
> +%endif
> +
> +.inner_loop:
> +%if ARCH_X86_64
> + movaps m6, [coeffq+cnt1q*4 ]
> + movaps m7, [coeffq+cnt1q*4+16]
> + movaps m8, [coeffq+cnt1q*4+32]
> + movaps m9, [coeffq+cnt1q*4+48]
> + mulps m0, m5, m6
> + mulps m1, m5, m7
> + mulps m2, m5, m8
> + mulps m3, m5, m9
> +%else
> + movaps m6, [coeffq+cnt1q*4 ]
> + movaps m7, [coeffq+cnt1q*4+16]
> + mulps m0, m5, m6
> + mulps m1, m5, m7
> + mulps m2, m5, [coeffq+cnt1q*4+32]
> + mulps m3, m5, [coeffq+cnt1q*4+48]
> +%endif
Is OOE the reason why you don't move the common code out of those
conditional blocks? Otherwise, it looks cleaner to me to do:
movaps m6, [coeffq+cnt1q*4 ]
movaps m7, [coeffq+cnt1q*4+16]
mulps m0, m3, m6
mulps m1, m3, m7
%if ARCH_X86_64
movaps m8, [coeffq+cnt1q*4+32]
movaps m9, [coeffq+cnt1q*4+48]
mulps m2, m5, m8
mulps m3, m5, m9
%else
mulps m2, m5, [coeffq+cnt1q*4+32]
mulps m3, m5, [coeffq+cnt1q*4+48]
%endif
and let OOE do its job.
Secondly, m5 is not reused afterwards, so maybe replace m5 by m3 for
all code up to this, and load something into m5 instead?
> + haddps m0, m1
> + haddps m2, m3
> + haddps m0, m2
> + movaps [samplesq+cnt1q], m0
I suppose you've already looked at most arrangements that would help
doing fewer shuffles. And I don't see any obvious one either.
--
Christophe
More information about the ffmpeg-devel
mailing list