[FFmpeg-devel] Fwd: Fixpoint FFT optimization, with MDCT and IMDCT wrappers for audio optimization

Michael Niedermayer michaelni
Thu Aug 23 04:48:11 CEST 2007


Hi

On Wed, Aug 22, 2007 at 10:30:49PM -0400, Marc Hoffman wrote:
> 
> On Aug 22, 2007, at 10:03 PM, Michael Niedermayer wrote:
> 
> > Hi
> >
> > On Wed, Aug 22, 2007 at 08:31:12PM -0400, Marc Hoffman wrote:
> >> On 8/21/07, Michael Niedermayer <michaelni at gmx.at> wrote:
> >>> Hi
> >>>
> >>> On Tue, Aug 21, 2007 at 05:02:18PM -0400, Marc Hoffman wrote:
> >>>> Hi,
> >>>>
> >>>> On 8/19/07, Justin Ruggles <justinruggles at bellsouth.net> wrote:
> >>>>>
> >>>>>                      256    512   1024   2048    4096
> >>>>> -------------------------------------------------------
> >>>>> (1) fft32           46.1   101.0  205.0  494.4  1169.3
> >>>>> (2) fftr2            7.6    16.0   34.1  101.3   258.5
> >>>>> (3) fft ffmpeg       7.4    16.4   33.8   82.4   187.7
> >>>>> (4) fftr4/2          6.7    13.6   28.2   74.0   169.8
> >>>>>
> >>>>> (1) fixedpoint 32bit simple RAD2 fft
> >>>>> (2) simple rad2 first stage rad4
> >>>>> (3) current lavc RAD2, first 2 stages, and middle pulled
> >>>>> (4) rad4/2 with last stage rad2 if needed
> >>>>
> >>>> I'm adding a split radix to the list.  Its a lot harder to optimize
> >>>> because of the extra pointer manipulation and control  
> >>>> structures.  I
> >>>> did manage to unroll the W0 multipliers out but getting a clean
> >>>> iteration space has not happened yet.
> >>>
> >>> could you also add the split radix fft from
> >>> liba52-cvs/a52dec/liba52/imdct.c
> >>> as comparission
> >>>
> >>> also djbfft and fftw2 would be interresting
> >>> IIRC they all use split radix ffts
> >>>
> >>> and looking at them might help you with optimizing your code
> >>
> >> Your right if we do those techniques we would have to implement a
> >> different optimized FFT for every possible data size we need. Unless
> >> I'm missing something please advise.
> >
> > each of these implementations needs exactly 4 lines of code (not  
> > counting {}
> > in liba52)
> >
> > now all the ffts are n^2 based so if we assume that the largest we  
> > want
> > to support is 1mb then we would need 80 lines of code max
> >
> > and if you want to support all sizes up to 2^64 thats still just  
> > 256 lines
> > and if you use a macro well its just 64+4 lines then
> >
> > that said iam not saying this aproach should be taken but if you  
> > cant find
> > a faster and simpler solution then it might be a good choice
> 
> Michael,
> 
> This is a waste of our time, my judgement is to keep this structure  
> regular and simple.  If we make a mess with macros its just difficult  
> for us to debug and maintain.  I suggest we use the RAD4/RAD2 fft we  
> just developed its a good improvement over the existing code and much  
> more regular.  Our bigger issue is to hook the fft through  
> DSPContext, 

yes please dont waste our time, especially not mine, just import the
liba52 split radix fft under HAVE_GPL
any alternative you suggest will be benchmarked by me (we will see if your
rad4 code is faster then any random properly written split radix fft)


> and not struggle with the high level FFT we use IMHO.

you dont want to choose the fastest algorthm? fine, grab git ffmpeg and
commit it there then publish it somewhere, in main ffmpeg i will only
accept the best algorithms, if you can proof that rad4 is faster, fine
ill accept it but the benchmarks have to be against well written split
radix ffts not some code you dont show us and then start talking about
some mysterious additional pointers your split radix fft would need


> 
> The thing I have a much better appreciation for thanks to you, is how  
> much extra overhead a fixedpoint FFT has over a floating point one on  
> a floating point machine.   All the normalization code is just  
> expensive, do you have a good idea of how we might move in the  
> direction of fixedpoint but still maintain floating point performance  
> on floating point machines?

well first part you reduce the number of multiplications
split radix would do that
but i guess you can make x operations faster by low level optimizations
than half of the same operations to which the same low level optimizations
could be applied

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I hate to see young programmers poisoned by the kind of thinking
Ulrich Drepper puts forward since it is simply too narrow -- Roman Shaposhnik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070823/19cda001/attachment.pgp>



More information about the ffmpeg-devel mailing list