[FFmpeg-devel] IDCT permutation (was: pre discussion around Blackfin dct_quantize_bfin routine)

Thu Jun 14 14:12:47 CEST 2007

Hi,

On 6/14/07, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> On Wednesday 13 June 2007 21:14, Michael Niedermayer wrote:
> > Hi
> >
> > On Wed, Jun 13, 2007 at 10:57:26AM +0300, Siarhei Siamashka wrote:
> > [...]
> >
> > > > also decoding involes a mandatory permutation
> > > > so no matter what idct_permutation is set to it will be the same speed
> > > > and wisely setting the idct permutation can simplify the idct and thus
> > > > speed it up, this is a high level optimization and wont make code
> > > > slower no matter how expensive the permutation is as there arent more
> > > > permutations done
> > >
> > > By looking at ffmpeg code, this does not seem to be absolutely true...
> > >
> > > > the extra cost is just on the encoder side, where its just a single
> > > > if() if its the no permutation case ...
> > >
> > > Please check the patch which is attached. It was generated by a ruby
> > > script which is also attached:
> >
> > [...]
> >
> > > Before patch:
> > >
> > > $ ./mplayer.orig  -nosound -quiet -benchmark -vo null -loop
> > > 3 /media/mmc1/Video/MissionImpossible3_Trailer4.divx | grep BENCHMARKs
> > > BENCHMARKs: VC:  89.976s VO:   0.034s A:   0.000s Sys:   1.089s =
> > > 91.098s BENCHMARKs: VC:  93.419s VO:   0.033s A:   0.000s Sys:   1.069s =
> > >   94.521s BENCHMARKs: VC:  93.307s VO:   0.032s A:   0.000s Sys:   1.078s
> > > =   94.418s
> > >
> > > After patch:
> > >
> > > ~ $ ./mplayer.patched -nosound -quiet -benchmark -vo null -loop
> > > 3 /media/mmc1/Video/MissionImpossible3_Trailer4.divx | grep BENCHMARKs
> > > BENCHMARKs: VC:  87.998s VO:   0.036s A:   0.000s Sys:   1.086s =
> > > 89.120s BENCHMARKs: VC:  91.074s VO:   0.035s A:   0.000s Sys:   1.069s =
> > >   92.177s BENCHMARKs: VC:  91.377s VO:   0.036s A:   0.000s Sys:   1.069s
> > > =   92.482s
> >
> > i do not belive that the changes in the patch (95% of them in very rarely
> > executed init code) directly caused this difference
> > the only part which might have caused it is the ac prediction,
>
> This all makes it very interesting. The difference on x86 is barely
> noticeable (tested on athlon xp). I wonder what makes it cause such
> an effect on ARM? Maybe it could be data cache (16K only) getting
> thrashed heavily on video decoding and causing cache misses on
> doing permutation table lookups. Or probably removing table lookup
> makes the code simplier and optimizer is suddenly capable to allocate
> registers better all over the function resulting in better performance...
> This all can, and probably needs to be verified. At least simulations
> with callgrind can provide some insights on cache usage and statistics
> about overall number of premutation table lookups done and the places
> where they occur most (that permutation lookup macro can be replaced
> with a function for testing purposes).
>
> On a related side, optimizing IDCT on ARM seems to have a much lower
> effect than on x86. Even removing IDCT completely (just for test) does not
> affect performance much. Looks like a lot of time is spent in ac prediction
> and other decoder parts. Investigating what happens around may provide
> some interesting information about what can be improved.
>
> > if this
> > really is that critical iam sure there are more sensibly ways to optimize
> > it, keep in mind we are permutating a lot of zeros around and we know where
> > the last non zero element is
>
> > also your patch breaks half of the IDCTs if the permutation is ignored
>
> It might make some sense to sacrifice these IDCTs to gain some more
> performance (if the performance improvement is worth it) in some
> configurations in practice. But I agree that a more generic solution would be
> better.
>
> Performing table lookup only to get the same value for non-permutating
> IDCTs causes some overhead (whether it is big enough to worry or not is
> another matter). To get other IDCTs working, it should be possible to do
> permutation just before calling IDCT (yes, it would be less inefficient and
> slower). This way the code would be more favourable for non-permutating IDCTs
> and cause slowdown for all the others. Anyway, supporting all this is hardly
> interesting for mainstream architectures such as x86. I just wonder how
> blackfin or other simple pipeline processors behave in this respect, that's
> why I posted this patch for test in blackfin discussion thread.

Is the ARM dsp optimizations complete from your point of view and this
is now where your planning to optimizing the code to get the biggest
bang for the buck?  The arm you are using what is the memory hierarchy
configuration?

Thanks
Marc