[FFmpeg-devel] [PATCH 3/3] Use DSPContext.vector_fmul() and DSPContext.vector_fmul_reverse() in floating-point version of apply_window(). 46% faster in function apply_window().

Tue Jan 4 17:31:11 CET 2011

On 01/01/2011 10:30 PM, Justin Ruggles wrote:

> On 01/01/2011 10:09 PM, Michael Niedermayer wrote:
> 
>> On Fri, Dec 31, 2010 at 03:11:40PM -0500, Justin Ruggles wrote:
>>> diff --git libavcodec/ac3enc_float.c libavcodec/ac3enc_float.c
>>> index 6a061d6..addc84f 100644
>>> --- libavcodec/ac3enc_float.c
>>> +++ libavcodec/ac3enc_float.c
>>> @@ -77,16 +77,13 @@ static void mdct512(AC3MDCTContext *mdct, float *out, float *in)
>>>  /**
>>>   * Apply KBD window to input samples prior to MDCT.
>>>   */
>>> -static void apply_window(float *output, const float *input,
>>> +static void apply_window(DSPContext *dsp, float *output, const float *input,
>>>                           const float *window, int n)
>>>  {
>>> -    int i;
>>>      int n2 = n >> 1;
>>> -
>>> -    for (i = 0; i < n2; i++) {
>>> -        output[i]     = input[i]     * window[i];
>>> -        output[n-i-1] = input[n-i-1] * window[i];
>>> -    }
>>> +    memcpy(output, input, n2 * sizeof(*input));
>>> +    dsp->vector_fmul(output, window, n2);
>>> +    dsp->vector_fmul_reverse(output+n2, input+n2, window, n2);
>>
>> The memcpy is ugly
> 
> 
> yeah, I know...  I'll see if I can implement a new version of
> vector_fmul that will handle different input from output and compare the
> speed.

Currently we have vector_fmul() for: C, neon, vfp, altivec, 3dnow, sse

I implemented vector_fmul_copy() for C, altivec, 3dnow, and sse to use 2
src and 1 dst. The Altivec version of vector_fmul_copy() has not been
tested, but I implemented it in the hope that someone else will test and
review it.  Here are some benchmarks on my Athlon64. benchmark numbers
are in dezicycles.

C (current SVN): 13366

memcpy(256) + vector_fmul(256) + vector_fmul_reverse(256)
    C: 18014
3DNow: 10193
  SSE:  8685

vector_fmul_copy(256) + vector_fmul_reverse(256)
    C: 16312
3DNow:  8682
  SSE:  7280

vector_fmul_copy(512)
    C: 16165
3DNow:  6043
  SSE:  6193

Note that the 3DNow version of vector_fmul_copy(512) is faster on my
system for some reason... I'm not sure how to detect this case or if it
is consistent across all CPUs, all Athlon64, or whatever.

I also tried to rewrite the current C version in SSE.  It was faster
than the fmul_copy+fmul_reverse since it basically merges the 2 loops,
but it was slower than vector_fmul_copy(512).  I left that out of the
patch.  If anyone is interested I can send it...

vector_fmul_window2(512)
  SSE: 7021

Thanks,
Justin