[FFmpeg-devel] [RFC] SSE3/4 implementation of flac_encode_residual_lpc

Mon May 4 06:21:19 CEST 2009

On Sun, May 3, 2009 at 8:39 PM, Bobby Bingham <uhmmmm at gmail.com> wrote:
> On Sat, 25 Apr 2009 03:03:30 +0000 (UTC)
> Loren Merritt <lorenm at u.washington.edu> wrote:
>
>> On Fri, 24 Apr 2009, Bobby Bingham wrote:
>>
>> > Attached are patches to move flac_encode_residual_lpc to dsputils,
>> > and to add SSE3 and SSE4 implementations. ?I wrote the SSE3 first,
>> > but since it doesn't have signed 32x32 multiplication AFAICT, I
>> > ended up using double precision floats for it, and the result is
>> > code that's slower than the C version. ?Unless somebody has a
>> > suggestion of how to fix this, I'll drop the SSE3 version.
>> >
>> > I tried an SSE4 version because it does have signed 32x32->32
>> > multiplication, like the C version uses. ?Unfortunately, I don't
>> > have an SSE4-capable processor to test it with, so I can't check
>> > its speed or even its correctness. ?Benchmarks welcome.
>>
>> fails regression test on my Penryn.
>>
>> > +// TODO: look into palignr?
>>
>> Yea, do that. It should be possible to load each sample just once
>> (aligned), and do all other manipulation in registers.
>> There are no cpus with both lddqu and sse4, so you're paying the full
>> cost of unaligned loads.
>
> I've changed the code to use palignr, and hopefully fixed it to work
> correctly now. ?I've also removed the SSE3 code from this patch as I
> haven't managed to get it any faster by using integer arithmetic yet.

>"movdqu  -16(%3,%0), %%xmm4         \n\t"   // xmm4 = smp  [i-4 .. i-1]
>"movdqu  -12(%3,%0), %%xmm6         \n\t"   // xmm6 = smp  [i-3 .. i  ]

Any reason you didn't use palignr here?

>"movdqu     %%xmm5, %2              \n\t"

Is there a good reason why this store has to be unaligned?

> "phaddd     %%xmm1, %%xmm0          \n\t"
> "phaddd     %%xmm3, %%xmm2          \n\t"
> "phaddd     %%xmm2, %%xmm0          \n\t"   // xmm0 = [p0, p1, p2, p3]

Did you not find a better way of doing this without PHADD, given how slow it is?

>pmulld

pmulld is really really slow (6 clocks on Nehalem!).  If you make
certain assumptions about the nature of the input data (say, restrict
your code to only 16-bit samples), you might be able to use a faster
instruction.

>"movdqa     %%xmm5, %%xmm9          \n\t"

Does this asm really need to be x86_64-only?  If so, how about an
x86_32 version?

Dark Shikari