[FFmpeg-devel] [WIP] add sse4 flac lpc encoder

Tue Feb 4 06:48:45 CET 2014

On 02/02/14 10:18 PM, James Darnley wrote:
> A rather hacked together patch adding an sse4 version of the flac lpc
> encoder for 16-bit samples, flac_lpc_encode_c_16().  But it works correctly.
> 
> I have been using gprof to measure the time taken in functions.
> 
>> Each sample counts as 0.01 seconds.
>>   %   cumulative   self              self     total           
>>  time   seconds   seconds    calls  ms/call  ms/call  name    
> Original code:
>>  43.94     19.45    19.45                             flac_lpc_encode_c_16
> This patch:
>>  25.74     17.10     8.54                             ff_flac_enc_lpc_16_sse4
> 
> The fraction of total time is down from nearly half to just over a
> quarter.  The time reported by `time` is also less these ~12 seconds.
> 
> Original: 0m52.318s
> Patch:    0m40.198s
> 
> These tests were done with compression level 8 which does skew the time
> spent in these functions to be in my favour.
> 
> I already see that I can use 4 more xmm regs to unroll the loop more.

I tested just now, and the code is crashing for me.

> +INIT_XMM sse4
> +cglobal flac_enc_lpc_16, 3, 5, 4, 0, res, smp, coefs ; len, order, shift

You're calling the function with six arguments but this is only expecting 
three. You're also reserving five general purpose registers instead of six.

> +                                   ; r0   r1   r2      r3   r4     r5
> +
> +%define posj r3
> +%define negj r4
> +
> +movd m3, r5m ; shift
> +loop_len:
> +    pxor m0,  m0
> +    xor posj, posj
> +    xor negj, negj

You're losing the len and order values before using their registers as 
counters.

You could do
cglobal flac_enc_lpc_16, 6, 8, 4, res, smp, coefs, len, order, shift, pos, neg

Above, and rename things accordingly. Though you'd be using eight registers, 
making the code unsuitable for x86.

> +    loop_order:
> +        movd   m2, [coefsq+posj*4] ; c = coefs[j]
> +        SPLATD m2
> +        movu   m1, [smpq+negj*4-4] ; s = smp[i-j-1]
> +        pmulld m1,  m2
> +        paddd  m0,  m1             ; p += c * s
> +
> +        add posj, 1
> +        sub negj, 1
> +        cmp posj, r4m
> +    jne loop_order
> +
> +    psrad m0, m3                   ; p >>= shift
> +    movu  m1, [smpq]
> +    psubd m1, m0                   ; smp[i] - p
> +    movu  [resq], m1               ; res[i] = smp[i] - (p >> shift)
> +
> +    add resq, mmsize
> +    add smpq, mmsize
> +    sub DWORD r3m, mmsize/4
> +jg loop_len
> +RET

After changing what i mentioned above the code worked for me, though the speed 
gains weren't as good in my tests compared to what you reported. (I however 
used the default compression level).

Regards