[FFmpeg-devel] Fwd: Fixpoint FFT optimization, with MDCT and IMDCT wrappers for audio optimization

Thu Aug 23 05:08:20 CEST 2007

On Aug 22, 2007, at 10:48 PM, Michael Niedermayer wrote:

> Hi
>
> On Wed, Aug 22, 2007 at 10:30:49PM -0400, Marc Hoffman wrote:
>>
>> On Aug 22, 2007, at 10:03 PM, Michael Niedermayer wrote:
>>
>>> Hi
>>>
>>> On Wed, Aug 22, 2007 at 08:31:12PM -0400, Marc Hoffman wrote:
>>>> On 8/21/07, Michael Niedermayer <michaelni at gmx.at> wrote:
>>>>> Hi
>>>>>
>>>>> On Tue, Aug 21, 2007 at 05:02:18PM -0400, Marc Hoffman wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/19/07, Justin Ruggles <justinruggles at bellsouth.net> wrote:
>>>>>>>
>>>>>>>                      256    512   1024   2048    4096
>>>>>>> -------------------------------------------------------
>>>>>>> (1) fft32           46.1   101.0  205.0  494.4  1169.3
>>>>>>> (2) fftr2            7.6    16.0   34.1  101.3   258.5
>>>>>>> (3) fft ffmpeg       7.4    16.4   33.8   82.4   187.7
>>>>>>> (4) fftr4/2          6.7    13.6   28.2   74.0   169.8
>>>>>>>
>>>>>>> (1) fixedpoint 32bit simple RAD2 fft
>>>>>>> (2) simple rad2 first stage rad4
>>>>>>> (3) current lavc RAD2, first 2 stages, and middle pulled
>>>>>>> (4) rad4/2 with last stage rad2 if needed
>>>>>>
>>>>>> I'm adding a split radix to the list.  Its a lot harder to  
>>>>>> optimize
>>>>>> because of the extra pointer manipulation and control
>>>>>> structures.  I
>>>>>> did manage to unroll the W0 multipliers out but getting a clean
>>>>>> iteration space has not happened yet.
>>>>>
>>>>> could you also add the split radix fft from
>>>>> liba52-cvs/a52dec/liba52/imdct.c
>>>>> as comparission
>>>>>
>>>>> also djbfft and fftw2 would be interresting
>>>>> IIRC they all use split radix ffts
>>>>>
>>>>> and looking at them might help you with optimizing your code
>>>>
>>>> Your right if we do those techniques we would have to implement a
>>>> different optimized FFT for every possible data size we need.  
>>>> Unless
>>>> I'm missing something please advise.
>>>
>>> each of these implementations needs exactly 4 lines of code (not
>>> counting {}
>>> in liba52)
>>>
>>> now all the ffts are n^2 based so if we assume that the largest we
>>> want
>>> to support is 1mb then we would need 80 lines of code max
>>>
>>> and if you want to support all sizes up to 2^64 thats still just
>>> 256 lines
>>> and if you use a macro well its just 64+4 lines then
>>>
>>> that said iam not saying this aproach should be taken but if you
>>> cant find
>>> a faster and simpler solution then it might be a good choice
>>
>> Michael,
>>
>> This is a waste of our time, my judgement is to keep this structure
>> regular and simple.  If we make a mess with macros its just difficult
>> for us to debug and maintain.  I suggest we use the RAD4/RAD2 fft we
>> just developed its a good improvement over the existing code and much
>> more regular.  Our bigger issue is to hook the fft through
>> DSPContext,
>
> yes please dont waste our time, especially not mine, just import the
> liba52 split radix fft under HAVE_GPL
> any alternative you suggest will be benchmarked by me (we will see  
> if your
> rad4 code is faster then any random properly written split radix fft)
>
>
>> and not struggle with the high level FFT we use IMHO.
>
> you dont want to choose the fastest algorthm? fine, grab git ffmpeg  
> and
> commit it there then publish it somewhere, in main ffmpeg i will only
> accept the best algorithms, if you can proof that rad4 is faster, fine
> ill accept it but the benchmarks have to be against well written split
> radix ffts not some code you dont show us and then start talking about
> some mysterious additional pointers your split radix fft would need
>
>
>>
>> The thing I have a much better appreciation for thanks to you, is how
>> much extra overhead a fixedpoint FFT has over a floating point one on
>> a floating point machine.   All the normalization code is just
>> expensive, do you have a good idea of how we might move in the
>> direction of fixedpoint but still maintain floating point performance
>> on floating point machines?
>
> well first part you reduce the number of multiplications
> split radix would do that
> but i guess you can make x operations faster by low level  
> optimizations
> than half of the same operations to which the same low level  
> optimizations
> could be applied
>

int xditfft (FFTContext *c, FFTComplex *X)
{
     int i,j,k,m,n,is,id,n2,n4;
     int i0,i1,i2,i3;
     float r1,s1;
     int s, M;
     FFTComplex *W = c->exptabr4;
     FFTComplex x2,x3,t0,t1;
     FFTComplex w0,w3;

     n = 1<<c->nbits;
     m = c->nbits;

     is = 0;
     id = 4;

     while (is < n) {
         for (i0 = is; i0 < n; i0 = i0 + id) {
             i1 = i0 + 1;
             r1 = X[i1].re;
             s1 = X[i1].im;

             X[i1].re = X[i0].re - r1;
             X[i1].im = X[i0].im - s1;
             X[i0].re = X[i0].re + r1;
             X[i0].im = X[i0].im + s1;
         }
         is = 2*id - 2;
         id = 4 * id;
     }

     n2 = 2;

     s = n>>1;
     for (k=1; k < m ; k++) {
         s  = s  >> 1;
         n2 = n2 << 1;
         n4 = n2 >> 2;

         is = 0;
         id = 2 * n2;

         while ( is < n ) {
             for (i0 = is; i0 < n-1; i0 = i0 + id) {
                 i1 = i0 + n4;
                 i2 = i1 + n4;
                 i3 = i2 + n4;

                 t0.re = X[i2].re + X[i3].re;
                 t0.im = X[i2].im + X[i3].im;

                 t1.re = X[i2].re - X[i3].re;
                 t1.im = X[i2].im - X[i3].im;

                 X[i2].re = X[i0].re - t0.re;
                 X[i0].re = X[i0].re + t0.re;
                 X[i3].re = X[i1].re - t1.im;
                 X[i1].re = X[i1].re + t1.im;

                 X[i2].im = X[i0].im - t0.im;
                 X[i0].im = X[i0].im + t0.im;
                 X[i3].im = X[i1].im + t1.re;
                 X[i1].im = X[i1].im - t1.re;

             }
             is = 2 * id - n2 ;
             id = 4 * id;
         }

         for (j = 1; j< n4 ; j++) {
             w0 = W[3*s*j];
             w3 = W[3*s*j+2];

             is = j;
             id = 2 * n2;

             while ( is < n ) {

                 for (i0 = is; i0 < n-1; i0 = i0 + id) {
                     i1 = i0 + n4;
                     i2 = i1 + n4;
                     i3 = i2 + n4;

                     x2.re  = X[i2].re*w0.re;
                     x2.re -= X[i2].im*w0.im;
                     x2.im  = X[i2].re*w0.im;
                     x2.im += X[i2].im*w0.re;

                     x3.re  = X[i3].re*w3.re;
                     x3.re -= X[i3].im*w3.im;
                     x3.im  = X[i3].re*w3.im;
                     x3.im += X[i3].im*w3.re;

                     t0.re = x2.re + x3.re;
                     t0.im = x2.im + x3.im;

                     t1.re = x2.re - x3.re;
                     t1.im = x2.im - x3.im;

                     X[i2].re = X[i0].re - t0.re;
                     X[i0].re = X[i0].re + t0.re;
                     X[i3].re = X[i1].re - t1.im;
                     X[i1].re = X[i1].re + t1.im;

                     X[i2].im = X[i0].im - t0.im;
                     X[i0].im = X[i0].im + t0.im;
                     X[i3].im = X[i1].im + t1.re;
                     X[i1].im = X[i1].im - t1.re;

                 }
                 is = 2 * id - n2 + j ;
                 id = 4 * id;
             }
         }
     }
}