[FFmpeg-devel] [RFC] SSE3/4 implementation of flac_encode_residual_lpc
Sat Apr 25 08:15:22 CEST 2009
On Fri, 24 Apr 2009 17:53:23 -0700
Jason Garrett-Glaser <darkshikari at gmail.com> wrote:
> >"lddqu -16(%3,%0), %%xmm4 \n\t" // xmm4 = smp [i-4 ..
> LDDQU does not work correctly (it is equivalent to movdqu except takes
> one more byte to encode) on all SSE4-supporting CPUs.
Ok. The documentation I've seen seems to imply that lddqu might be
faster than movdqu, but doesn't say so definitively. I'll change it.
> >"cvtdq2pd -8(%3,%0), %%xmm5 \n\t" // xmm5 = smp [i-2, i-1]
> Is it really required to constantly convert in and out of floating
> point here? Mubench ( http://akuvian.org/src/mubench_results.txt )
> says that this operation is horrifically slow on Athlon 64, for
> example. Why not use integer math?
I realize it's slow -- I only have an Athlon 64 X2 here to test on. But
I either need signed 32x32 multiplication (which AFAICT SSE3 doesn't
offer) or to implement it myself on top of what is offered.
Conversion seemed easier, but I'll try to make integer math work next.
FWIW, my friend with an Intel chip (not sure exact model) reports what
sounds like slower performance relative to the C code than I get on the
Thanks for that link. It looks handy.
> >"phaddd %%xmm0, %%xmm0 \n\t"
> PHADD is slow and should be avoided where possible. If you're looking
> to sum the values in a register, a chain of binary-search-style
> shift/add is better. Here's what x264 uses:
> %macro HADDD 2
> movhlps %2, %1
> paddd %1, %2
> pshuflw %2, %1, 0xE
> paddd %1, %2
> %macro HADDW 2
> pmaddwd %1, [pw_1 GLOBAL]
> HADDD %1, %2
> > +// TODO: look into palignr?
> Yes, do this. Your code is going to be slow on Penryn, where
> cacheline-split loads are very expensive.
Will do. Thanks for the review. I'll try to rework the patch in the
next couple days.
> Dark Shikari
More information about the ffmpeg-devel