[FFmpeg-devel] [PATCH] iirfilter: Rework FILTER_O2() to improve performance.

Mon Jan 31 23:03:06 CET 2011

Justin Ruggles <justin.ruggles at gmail.com> writes:

> On 01/31/2011 04:19 PM, M?ns Rullg?rd wrote:
>
>> Justin Ruggles <justin.ruggles at gmail.com> writes:
>> 
>>> Store coefficient pointers locally.
>>> Store state locally, then save at end.
>>> Split up arithmetic.
>>> ---
>>>  libavcodec/iirfilter.c |   27 +++++++++++++++++++--------
>>>  1 files changed, 19 insertions(+), 8 deletions(-)
>>>
>>> I was able to gcc to do what I wanted, but it's not pretty.
>>>
>>> Also, if a simplified version of ff_iir_filter_flt() is added that
>>> doesn't read/save the state and always has sstep/dstep == 1, it can
>>> gain about another 3% speedup.  I'll save that for later though
>>> since it would be unnecessarily complicating something that's not
>>> even used yet.
>>>
>>> Athlon64
>>> current: 217731
>>> patched: 195939
>>> simple:  189951
>>>
>>> Atom
>>> current: 569708
>>> patched: 542916
>>> simple:  524855
>>>
>>> Cheers,
>>> Justin
>>>
>>> diff --git a/libavcodec/iirfilter.c b/libavcodec/iirfilter.c
>>> index bc63c39..8398e22 100644
>>> --- a/libavcodec/iirfilter.c
>>> +++ b/libavcodec/iirfilter.c
>>> @@ -257,19 +257,30 @@ av_cold struct FFIIRFilterState* ff_iir_filter_init_state(int order)
>>>  }
>>>  
>>>  #define FILTER_O2(type, fmt) {                                          \
>>> -    int i;                                                              \
>>>      const type *src0 = src;                                             \
>>>      type       *dst0 = dst;                                             \
>>> -    for (i = 0; i < size; i++) {                                        \
>>> -        float in = *src0   * c->gain  +                                 \
>>> -                   s->x[0] * c->cy[0] +                                 \
>>> -                   s->x[1] * c->cy[1];                                  \
>>> -        CONV_##fmt(*dst0, s->x[0] + in + s->x[1] * c->cx[1])            \
>>> -        s->x[0] = s->x[1];                                              \
>>> -        s->x[1] = in;                                                   \
>>> +    const float *g = &c->gain;                                          \
>> 
>> Why is this a pointer?
>> 
>>> +    const int   *x = c->cx;                                             \
>>> +    const float *y = c->cy;                                             \
>>> +    float s0 = s->x[0];                                                 \
>>> +    float s1 = s->x[1];                                                 \
>>> +    while (size--) {                                                    \
>>> +        float in  = *src0 * *g;                                         \
>>> +        float tmp = s0 * y[0];                                          \
>>> +        in += tmp;                                                      \
>>> +        tmp = s1 * y[1];                                                \
>>> +        in += tmp;                                                      \
>>> +        tmp = s1 * x[1];                                                \
>>> +        s0 += tmp;                                                      \
>>> +        s0 += in;                                                       \
>> 
>> Does using the 'tmp' variable really make it faster?  I have seen such
>> things with gcc before, but I have to ask.
>> 
>>> +        CONV_##fmt(*dst0, s0)                                           \
>>> +        s0 = s1;                                                        \
>>> +        s1 = in;                                                        \
>>>          src0 += sstep;                                                  \
>>>          dst0 += dstep;                                                  \
>>>      }                                                                   \
>>> +    s->x[0] = s0;                                                       \
>>> +    s->x[1] = s1;                                                       \
>>>  }
>>>  
>>>  void ff_iir_filter(const struct FFIIRFilterCoeffs *c,
>> 
>
> Everything that seems weird about the function is only because the
> obvious alternative is slower on my Athlon64.
>
> I basically started by disassembling the gcc output and converting it to
> yasm.  Then I tweaked and tweaked and tried every possible combination I
> could think of.  Some really random stuff was significantly slower.
> Once I got it to the fastest I could get it, I tried to get gcc to match
> the asm version as closely as possible.  Merging any of those tmp values
> makes it slower.  Changing any of the local pointers to values makes it
> slower.

I have a bad feeling about this.  Crazy effects like you describe here
tend to be very volatile, varying wildly between architectures and
compiler versions.  If there's a big improvement to be had, it is
better to write a proper asm function and leave the C code with
something sensible-looking.  Loading some values into local variables
can be expected to help due to potential aliasing being avoided.
Anything beyond that is entirely random, and might end up hurting one
CPU as much as it helps another.

-- 
M?ns Rullg?rd
mans at mansr.com