[FFmpeg-devel] [WIP] add sse4 flac lpc encoder

Mon Feb 3 05:55:15 CET 2014

On 02/02/14 10:18 PM, James Darnley wrote:
> A rather hacked together patch adding an sse4 version of the flac lpc
> encoder for 16-bit samples, flac_lpc_encode_c_16().  But it works correctly.
> 
> I have been using gprof to measure the time taken in functions.
> 
>> Each sample counts as 0.01 seconds.
>>   %   cumulative   self              self     total           
>>  time   seconds   seconds    calls  ms/call  ms/call  name    
> Original code:
>>  43.94     19.45    19.45                             flac_lpc_encode_c_16
> This patch:
>>  25.74     17.10     8.54                             ff_flac_enc_lpc_16_sse4
> 
> The fraction of total time is down from nearly half to just over a
> quarter.  The time reported by `time` is also less these ~12 seconds.
> 
> Original: 0m52.318s
> Patch:    0m40.198s
> 
> These tests were done with compression level 8 which does skew the time
> spent in these functions to be in my favour.
> 
> I already see that I can use 4 more xmm regs to unroll the loop more.

> diff --git a/libavcodec/flacdsp.c b/libavcodec/flacdsp.c
> index 02eba3e..8fae578 100644
> --- a/libavcodec/flacdsp.c
> +++ b/libavcodec/flacdsp.c
> @@ -26,7 +26,6 @@
>  #define SAMPLE_SIZE 16
>  #define PLANAR 0
>  #include "flacdsp_template.c"
> -#include "flacdsp_lpc_template.c"
>  
>  #undef  PLANAR
>  #define PLANAR 1
> @@ -43,6 +42,30 @@
>  #define PLANAR 1
>  #include "flacdsp_template.c"
>  
> +void ff_flac_enc_lpc_16_sse4(int32_t *, const int32_t *, const int32_t *, int, int, int);
> +
> +static void flac_lpc_encode_c_16(int32_t *res, const int32_t *smp, int len,
> +                                    int order, const int32_t *coefs, int shift)
> +{
> +    int i;
> +    for (i = 0; i < order; i++)
> +        res[i] = smp[i];
> +    /*for (i = order; i < len; i += 2) {
> +        int j;
> +        int s  = smp[i];
> +        int32_t p0 = 0, p1 = 0;
> +        for (j = 0; j < order; j++) {
> +            int c = coefs[j];
> +            p1   += (c*s);
> +            s     = smp[i-j-1];
> +            p0   += (c*s);
> +        }
> +        res[i  ] = smp[i  ] - (p0 >> shift);
> +        res[i+1] = smp[i+1] - (p1 >> shift);
> +    }*/
> +    ff_flac_enc_lpc_16_sse4(res+order, smp+order, coefs, len-order, order, shift);
> +}

I assume this is why you called this a hacked together WIP.
With this the SSE4 version of the function is masquerading itself as the c one. You're 
effectively forcing it for every user regardless of arch or CPU. All while removing the 
c version altogether.

Look at my lpc_32 dec patch for the proper way to introduce asm functions.
CPU capabilities need to be checked at runtime, and function pointers set accordingly.

This means that ff_flac_enc_lpc_16_sse4() needs to have the same parameters and expect 
the same arguments as flac_lpc_encode_c_16.
That "res[i] = smp[i]" loop and those additions and subtractions you did before calling 
the function will have to be done inside the actual asm function.

That aside, those are some nice gains. Puts my decoder to shame :P (Assuming of course 
the code i wrote is not optimal and there's a way to get more out of it, which is 
likely the case).

Regards.