[FFmpeg-devel] [PATCH] SSE-optimized vector_clipf()

Vitor Sessak vitor1001
Sat Aug 8 09:04:14 CEST 2009


Michael Niedermayer wrote:
> On Thu, Aug 06, 2009 at 02:55:30AM +0200, Vitor Sessak wrote:
>> Vitor Sessak wrote:
>>> $subj, 10% speedup for twinvq decoding (but should be useful also for AMR 
>>> and wmapro).
>> err, I mean, attached.
>>
>> -Vitor
> 
>>  dsputil.c         |   15 +++++++++++++++
>>  dsputil.h         |    3 ++-
>>  x86/dsputil_mmx.c |   34 ++++++++++++++++++++++++++++++++++
>>  3 files changed, 51 insertions(+), 1 deletion(-)
>> 8a95f5f2f3d267089056d6a571b2e6cc37d1569e  dsp_vector_clipf.diff
>> Index: libavcodec/dsputil.c
>> ===================================================================
>> --- libavcodec/dsputil.c	(revision 19598)
>> +++ libavcodec/dsputil.c	(working copy)
>> @@ -4093,6 +4093,20 @@
>>          dst[i] = src[i] * mul;
>>  }
>>  
>> +void vector_clipf_c(float *dst, float min, float max, int len) {
>> +    int i;
>> +    for (i=0; i < len; i+=8) {
>> +        dst[i    ] = av_clipf(dst[i    ], min, max);
>> +        dst[i + 1] = av_clipf(dst[i + 1], min, max);
>> +        dst[i + 2] = av_clipf(dst[i + 2], min, max);
>> +        dst[i + 3] = av_clipf(dst[i + 3], min, max);
>> +        dst[i + 4] = av_clipf(dst[i + 4], min, max);
>> +        dst[i + 5] = av_clipf(dst[i + 5], min, max);
>> +        dst[i + 6] = av_clipf(dst[i + 6], min, max);
>> +        dst[i + 7] = av_clipf(dst[i + 7], min, max);
>> +    }
>> +}
> 
> this one could be tried by using integer math instead of floats
> (assuming IEEE floats of course)

How could this possibly be faster? It would just clip the sign, then the 
exponent, then the mantissa. It seems like much more work for me, unless 
I'm missing something.

>>  static av_always_inline int float_to_int16_one(const float *src){
>>      int_fast32_t tmp = *(const int32_t*)src;
>>      if(tmp & 0xf0000){
>> @@ -4669,6 +4683,7 @@
>>      c->vector_fmul_add_add = ff_vector_fmul_add_add_c;
>>      c->vector_fmul_window = ff_vector_fmul_window_c;
>>      c->int32_to_float_fmul_scalar = int32_to_float_fmul_scalar_c;
>> +    c->vector_clipf = vector_clipf_c;
>>      c->float_to_int16 = ff_float_to_int16_c;
>>      c->float_to_int16_interleave = ff_float_to_int16_interleave_c;
>>      c->add_int16 = add_int16_c;
>> Index: libavcodec/dsputil.h
>> ===================================================================
>> --- libavcodec/dsputil.h	(revision 19598)
>> +++ libavcodec/dsputil.h	(working copy)
>> @@ -396,7 +396,8 @@
>>      void (*vector_fmul_window)(float *dst, const float *src0, const float *src1, const float *win, float add_bias, int len);
>>      /* assume len is a multiple of 8, and arrays are 16-byte aligned */
>>      void (*int32_to_float_fmul_scalar)(float *dst, const int *src, float mul, int len);
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> -
>> +    /* assume len is a multiple of 16, and dst is 16-byte aligned */
>> +    void (*vector_clipf)(float *dst, float min, float max, int len);
> 
> align requirements are generally writen like:
> 
> void (*get_pixels)(DCTELEM *block/*align 16*/, const uint8_t *pixels/*align 8*/, int line_size);

Changed, but in this case dsputils.h is inconsistent (see above)

>>      /* C version: convert floats from the range [384.0,386.0] to ints in [-32768,32767]
>>       * simd versions: convert floats from [-32768.0,32767.0] without rescaling and arrays are 16byte aligned */
>>      void (*float_to_int16)(int16_t *dst, const float *src, long len);
>> Index: libavcodec/x86/dsputil_mmx.c
>> ===================================================================
>> --- libavcodec/x86/dsputil_mmx.c	(revision 19598)
>> +++ libavcodec/x86/dsputil_mmx.c	(working copy)
>> @@ -2346,6 +2346,39 @@
>>      );
>>  }
>>  
>> +void vector_clipf_sse(float *dst, float min, float max, int len)
>> +{
>> +    x86_reg i = (len-16)*4;
>> +    __asm__ volatile(
>> +        "movss  %2, %%xmm4 \n"
>> +        "movss  %3, %%xmm5 \n"
>> +        "shufps $0, %%xmm4, %%xmm4 \n"
>> +        "shufps $0, %%xmm5, %%xmm5 \n"
>> +        "1: \n\t"
>> +        "movaps    (%1,%0), %%xmm0 \n\t" // 3/1 on intel
>> +        "movaps  16(%1,%0), %%xmm1 \n\t"
>> +        "movaps  32(%1,%0), %%xmm2 \n\t"
>> +        "movaps  48(%1,%0), %%xmm3 \n\t"
>> +        "maxps      %%xmm4, %%xmm0 \n\t"
>> +        "maxps      %%xmm4, %%xmm1 \n\t"
>> +        "maxps      %%xmm4, %%xmm2 \n\t"
>> +        "maxps      %%xmm4, %%xmm3 \n\t"
>> +        "minps      %%xmm5, %%xmm0 \n\t"
>> +        "minps      %%xmm5, %%xmm1 \n\t"
>> +        "minps      %%xmm5, %%xmm2 \n\t"
>> +        "minps      %%xmm5, %%xmm3 \n\t"
>> +        "movaps  %%xmm0,   (%1,%0) \n\t"
>> +        "movaps  %%xmm1, 16(%1,%0) \n\t"
>> +        "movaps  %%xmm2, 32(%1,%0) \n\t"
>> +        "movaps  %%xmm3, 48(%1,%0) \n\t"
>> +        "sub  $64, %0 \n\t"
> 
> did you benchmark the backward direction vs forward?

I've tried (both synthetic and twinvq dec) and found no measurable 
difference.

I've attached a new version that now accepts in != out. Also a patch for 
using it in qcelpdec.c.

-Vitor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dsp_vector_clipf2.diff
Type: text/x-diff
Size: 4136 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090808/d86767ff/attachment.diff>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dsp_qcelp.diff
Type: text/x-diff
Size: 1137 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090808/d86767ff/attachment-0001.diff>



More information about the ffmpeg-devel mailing list