[FFmpeg-devel] [RFC] SSE3/4 implementation of flac_encode_residual_lpc

Sat Apr 25 02:53:23 CEST 2009

>"lddqu   -16(%3,%0), %%xmm4         \n\t"   // xmm4 = smp  [i-4 .. i-1]

LDDQU does not work correctly (it is equivalent to movdqu except takes
one more byte to encode) on all SSE4-supporting CPUs.

>"cvtdq2pd -8(%3,%0), %%xmm5         \n\t"   // xmm5 = smp  [i-2, i-1]

Is it really required to constantly convert in and out of floating
point here?  Mubench ( http://akuvian.org/src/mubench_results.txt )
says that this operation is horrifically slow on Athlon 64, for
example.  Why not use integer math?

>"phaddd     %%xmm0, %%xmm0          \n\t"

PHADD is slow and should be avoided where possible.  If you're looking
to sum the values in a register, a chain of binary-search-style
shift/add is better.  Here's what x264 uses:

%macro HADDD 2
    movhlps %2, %1
    paddd   %1, %2
    pshuflw %2, %1, 0xE
    paddd   %1, %2
%endmacro

%macro HADDW 2
    pmaddwd %1, [pw_1 GLOBAL]
    HADDD   %1, %2
%endmacro

> +// TODO: look into palignr?

Yes, do this.  Your code is going to be slow on Penryn, where
cacheline-split loads are very expensive.

Dark Shikari