[FFmpeg-devel] [RFC] ac3dec: use dsputil.clear_block

Thu Jan 14 07:56:38 CET 2010

On Wed, Jan 13, 2010 at 08:54:37PM -0500, Justin Ruggles wrote:
> Michael Niedermayer wrote:
> > On Wed, Jan 13, 2010 at 06:32:27PM -0500, Justin Ruggles wrote:
> >> Reimar D?ffinger wrote:
> >>
> >>> On Wed, Jan 13, 2010 at 11:42:27PM +0100, Michael Niedermayer wrote:
> >>>> On Wed, Jan 13, 2010 at 09:46:17PM +0100, Reimar D?ffinger wrote:
> >>>>> Hello,
> >>>>> this gives an overall speedup of about 1.1 % on Intel Atom with my sample.
> >>>>> Testing with other CPUs and samples heavily welcome, I suspect a slowdown may 
> >>>>> be possible, beside it being a bit ugly.
> >>>> what happens with these coeffs afterwards?
> >>>> is it
> >>>> s->dsp.int32_to_float_fmul_scalar(s->transform_coeffs[ch], s->fixed_coeffs[ch], gain, 256);
> >>>> ?
> >>>> if so maybe that could be changed to not touch the supposed to be zero
> >>>> coeffs?
> >>> I had that idea as well, but I have to admit I do not know, nor if I will have the time
> >>> to understand the code well enough to know :-)
> >> Yes the only time they're used is when they're converted to float.  If
> >> we don't zero the integer coeffs, we will need to zero the float coeffs
> >> because the IMDCT uses all 256.
> > 
> > are the float coeffs ever written to besides in int32_to_float_fmul_scalar ?
> > because if not you could keep track of the last non zero coeff and just
> > zero upto that each time
> 
> The bandwidth can change from block to block, so there could be leftover
>  higher frequency coeffs from the last block.

Note that these are all very minor optimizations, the really big chunk with 35%
of decoding time is ac3_decode_transform_coeffs_ch which I think is far too much
for what it does.
I guess it should be possible to improve it by splitting it in parts, since a
major issue seems to be register starvation (from first tests, I already expect
a 4% speedup from just making m->b* etc. local stack variables instead of
using them directly, more if it's possible to make the compiler keep them
in a register - it might be enough to justify doing multiple passes over
baps on x86 - it might hurt on architectures with sufficient registers though).