[FFmpeg-devel] [RFC] SSE3/4 implementation of flac_encode_residual_lpc
Jason Garrett-Glaser
darkshikari
Mon May 4 06:21:19 CEST 2009
On Sun, May 3, 2009 at 8:39 PM, Bobby Bingham <uhmmmm at gmail.com> wrote:
> On Sat, 25 Apr 2009 03:03:30 +0000 (UTC)
> Loren Merritt <lorenm at u.washington.edu> wrote:
>
>> On Fri, 24 Apr 2009, Bobby Bingham wrote:
>>
>> > Attached are patches to move flac_encode_residual_lpc to dsputils,
>> > and to add SSE3 and SSE4 implementations. ?I wrote the SSE3 first,
>> > but since it doesn't have signed 32x32 multiplication AFAICT, I
>> > ended up using double precision floats for it, and the result is
>> > code that's slower than the C version. ?Unless somebody has a
>> > suggestion of how to fix this, I'll drop the SSE3 version.
>> >
>> > I tried an SSE4 version because it does have signed 32x32->32
>> > multiplication, like the C version uses. ?Unfortunately, I don't
>> > have an SSE4-capable processor to test it with, so I can't check
>> > its speed or even its correctness. ?Benchmarks welcome.
>>
>> fails regression test on my Penryn.
>>
>> > +// TODO: look into palignr?
>>
>> Yea, do that. It should be possible to load each sample just once
>> (aligned), and do all other manipulation in registers.
>> There are no cpus with both lddqu and sse4, so you're paying the full
>> cost of unaligned loads.
>
> I've changed the code to use palignr, and hopefully fixed it to work
> correctly now. ?I've also removed the SSE3 code from this patch as I
> haven't managed to get it any faster by using integer arithmetic yet.
>"movdqu -16(%3,%0), %%xmm4 \n\t" // xmm4 = smp [i-4 .. i-1]
>"movdqu -12(%3,%0), %%xmm6 \n\t" // xmm6 = smp [i-3 .. i ]
Any reason you didn't use palignr here?
>"movdqu %%xmm5, %2 \n\t"
Is there a good reason why this store has to be unaligned?
> "phaddd %%xmm1, %%xmm0 \n\t"
> "phaddd %%xmm3, %%xmm2 \n\t"
> "phaddd %%xmm2, %%xmm0 \n\t" // xmm0 = [p0, p1, p2, p3]
Did you not find a better way of doing this without PHADD, given how slow it is?
>pmulld
pmulld is really really slow (6 clocks on Nehalem!). If you make
certain assumptions about the nature of the input data (say, restrict
your code to only 16-bit samples), you might be able to use a faster
instruction.
>"movdqa %%xmm5, %%xmm9 \n\t"
Does this asm really need to be x86_64-only? If so, how about an
x86_32 version?
Dark Shikari
More information about the ffmpeg-devel
mailing list