[FFmpeg-devel] [RFC] SSE3/4 implementation of flac_encode_residual_lpc

Loren Merritt lorenm
Sat Apr 25 05:03:30 CEST 2009

On Fri, 24 Apr 2009, Bobby Bingham wrote:

> Attached are patches to move flac_encode_residual_lpc to dsputils, and
> to add SSE3 and SSE4 implementations.  I wrote the SSE3 first, but
> since it doesn't have signed 32x32 multiplication AFAICT, I ended up
> using double precision floats for it, and the result is code that's
> slower than the C version.  Unless somebody has a suggestion of how to
> fix this, I'll drop the SSE3 version.
> I tried an SSE4 version because it does have signed 32x32->32
> multiplication, like the C version uses.  Unfortunately, I don't have an
> SSE4-capable processor to test it with, so I can't check its speed or
> even its correctness.  Benchmarks welcome.

fails regression test on my Penryn.

> +// TODO: look into palignr?

Yea, do that. It should be possible to load each sample just once 
(aligned), and do all other manipulation in registers.
There are no cpus with both lddqu and sse4, so you're paying the full 
cost of unaligned loads.

For 16bit samples, stereo decorrelation sometimes makes some of the 
channels 17bit, but other channels are still 16bit, so a 16bit-specific 
optimization would still help. And lpc coefs always fit in 15bit. So 
pmaddwd should be usable.
For the 17bit cases, you could split them into two 16bit halves, though 
that would significantly increase the number of arithmetic ops, so I'm not 
sure if it's useful.
24bit flac of course needs pmulld.

--Loren Merritt

More information about the ffmpeg-devel mailing list