[FFmpeg-devel] Fwd: Fixpoint FFT optimization, with MDCT and IMDCT wrappers for audio optimization
Marc Hoffman
mmh
Thu Aug 23 05:08:20 CEST 2007
On Aug 22, 2007, at 10:48 PM, Michael Niedermayer wrote:
> Hi
>
> On Wed, Aug 22, 2007 at 10:30:49PM -0400, Marc Hoffman wrote:
>>
>> On Aug 22, 2007, at 10:03 PM, Michael Niedermayer wrote:
>>
>>> Hi
>>>
>>> On Wed, Aug 22, 2007 at 08:31:12PM -0400, Marc Hoffman wrote:
>>>> On 8/21/07, Michael Niedermayer <michaelni at gmx.at> wrote:
>>>>> Hi
>>>>>
>>>>> On Tue, Aug 21, 2007 at 05:02:18PM -0400, Marc Hoffman wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/19/07, Justin Ruggles <justinruggles at bellsouth.net> wrote:
>>>>>>>
>>>>>>> 256 512 1024 2048 4096
>>>>>>> -------------------------------------------------------
>>>>>>> (1) fft32 46.1 101.0 205.0 494.4 1169.3
>>>>>>> (2) fftr2 7.6 16.0 34.1 101.3 258.5
>>>>>>> (3) fft ffmpeg 7.4 16.4 33.8 82.4 187.7
>>>>>>> (4) fftr4/2 6.7 13.6 28.2 74.0 169.8
>>>>>>>
>>>>>>> (1) fixedpoint 32bit simple RAD2 fft
>>>>>>> (2) simple rad2 first stage rad4
>>>>>>> (3) current lavc RAD2, first 2 stages, and middle pulled
>>>>>>> (4) rad4/2 with last stage rad2 if needed
>>>>>>
>>>>>> I'm adding a split radix to the list. Its a lot harder to
>>>>>> optimize
>>>>>> because of the extra pointer manipulation and control
>>>>>> structures. I
>>>>>> did manage to unroll the W0 multipliers out but getting a clean
>>>>>> iteration space has not happened yet.
>>>>>
>>>>> could you also add the split radix fft from
>>>>> liba52-cvs/a52dec/liba52/imdct.c
>>>>> as comparission
>>>>>
>>>>> also djbfft and fftw2 would be interresting
>>>>> IIRC they all use split radix ffts
>>>>>
>>>>> and looking at them might help you with optimizing your code
>>>>
>>>> Your right if we do those techniques we would have to implement a
>>>> different optimized FFT for every possible data size we need.
>>>> Unless
>>>> I'm missing something please advise.
>>>
>>> each of these implementations needs exactly 4 lines of code (not
>>> counting {}
>>> in liba52)
>>>
>>> now all the ffts are n^2 based so if we assume that the largest we
>>> want
>>> to support is 1mb then we would need 80 lines of code max
>>>
>>> and if you want to support all sizes up to 2^64 thats still just
>>> 256 lines
>>> and if you use a macro well its just 64+4 lines then
>>>
>>> that said iam not saying this aproach should be taken but if you
>>> cant find
>>> a faster and simpler solution then it might be a good choice
>>
>> Michael,
>>
>> This is a waste of our time, my judgement is to keep this structure
>> regular and simple. If we make a mess with macros its just difficult
>> for us to debug and maintain. I suggest we use the RAD4/RAD2 fft we
>> just developed its a good improvement over the existing code and much
>> more regular. Our bigger issue is to hook the fft through
>> DSPContext,
>
> yes please dont waste our time, especially not mine, just import the
> liba52 split radix fft under HAVE_GPL
> any alternative you suggest will be benchmarked by me (we will see
> if your
> rad4 code is faster then any random properly written split radix fft)
>
>
>> and not struggle with the high level FFT we use IMHO.
>
> you dont want to choose the fastest algorthm? fine, grab git ffmpeg
> and
> commit it there then publish it somewhere, in main ffmpeg i will only
> accept the best algorithms, if you can proof that rad4 is faster, fine
> ill accept it but the benchmarks have to be against well written split
> radix ffts not some code you dont show us and then start talking about
> some mysterious additional pointers your split radix fft would need
>
>
>>
>> The thing I have a much better appreciation for thanks to you, is how
>> much extra overhead a fixedpoint FFT has over a floating point one on
>> a floating point machine. All the normalization code is just
>> expensive, do you have a good idea of how we might move in the
>> direction of fixedpoint but still maintain floating point performance
>> on floating point machines?
>
> well first part you reduce the number of multiplications
> split radix would do that
> but i guess you can make x operations faster by low level
> optimizations
> than half of the same operations to which the same low level
> optimizations
> could be applied
>
int xditfft (FFTContext *c, FFTComplex *X)
{
int i,j,k,m,n,is,id,n2,n4;
int i0,i1,i2,i3;
float r1,s1;
int s, M;
FFTComplex *W = c->exptabr4;
FFTComplex x2,x3,t0,t1;
FFTComplex w0,w3;
n = 1<<c->nbits;
m = c->nbits;
is = 0;
id = 4;
while (is < n) {
for (i0 = is; i0 < n; i0 = i0 + id) {
i1 = i0 + 1;
r1 = X[i1].re;
s1 = X[i1].im;
X[i1].re = X[i0].re - r1;
X[i1].im = X[i0].im - s1;
X[i0].re = X[i0].re + r1;
X[i0].im = X[i0].im + s1;
}
is = 2*id - 2;
id = 4 * id;
}
n2 = 2;
s = n>>1;
for (k=1; k < m ; k++) {
s = s >> 1;
n2 = n2 << 1;
n4 = n2 >> 2;
is = 0;
id = 2 * n2;
while ( is < n ) {
for (i0 = is; i0 < n-1; i0 = i0 + id) {
i1 = i0 + n4;
i2 = i1 + n4;
i3 = i2 + n4;
t0.re = X[i2].re + X[i3].re;
t0.im = X[i2].im + X[i3].im;
t1.re = X[i2].re - X[i3].re;
t1.im = X[i2].im - X[i3].im;
X[i2].re = X[i0].re - t0.re;
X[i0].re = X[i0].re + t0.re;
X[i3].re = X[i1].re - t1.im;
X[i1].re = X[i1].re + t1.im;
X[i2].im = X[i0].im - t0.im;
X[i0].im = X[i0].im + t0.im;
X[i3].im = X[i1].im + t1.re;
X[i1].im = X[i1].im - t1.re;
}
is = 2 * id - n2 ;
id = 4 * id;
}
for (j = 1; j< n4 ; j++) {
w0 = W[3*s*j];
w3 = W[3*s*j+2];
is = j;
id = 2 * n2;
while ( is < n ) {
for (i0 = is; i0 < n-1; i0 = i0 + id) {
i1 = i0 + n4;
i2 = i1 + n4;
i3 = i2 + n4;
x2.re = X[i2].re*w0.re;
x2.re -= X[i2].im*w0.im;
x2.im = X[i2].re*w0.im;
x2.im += X[i2].im*w0.re;
x3.re = X[i3].re*w3.re;
x3.re -= X[i3].im*w3.im;
x3.im = X[i3].re*w3.im;
x3.im += X[i3].im*w3.re;
t0.re = x2.re + x3.re;
t0.im = x2.im + x3.im;
t1.re = x2.re - x3.re;
t1.im = x2.im - x3.im;
X[i2].re = X[i0].re - t0.re;
X[i0].re = X[i0].re + t0.re;
X[i3].re = X[i1].re - t1.im;
X[i1].re = X[i1].re + t1.im;
X[i2].im = X[i0].im - t0.im;
X[i0].im = X[i0].im + t0.im;
X[i3].im = X[i1].im + t1.re;
X[i1].im = X[i1].im - t1.re;
}
is = 2 * id - n2 + j ;
id = 4 * id;
}
}
}
}
More information about the ffmpeg-devel
mailing list