[FFmpeg-devel] [RFC] AAC Encoder

Wed Aug 13 08:16:48 CEST 2008

On Tue, Aug 12, 2008 at 07:48:59PM +0200, Michael Niedermayer wrote:
> On Tue, Aug 12, 2008 at 08:09:36PM +0300, Kostya wrote:
> > On Tue, Aug 12, 2008 at 02:14:20PM +0200, Michael Niedermayer wrote:
[...]
> > > We have a problem here, because this isnt optimal
> > > It seems we agree that each bit counts the same no matter what psy says.
> > > Maybe a example will best show the problem
> > > lets assume we have a coeff of 11.5, the psy model decides that a change
> > > to 10 would be ok for the given audio quality/bitrate and thus outputs 10
> > > let us assume that storing a coefficient of 10 and one of 11 both take
> > > 7 bit, the decission to store 10 clearly was bad. OTOH it could have
> > > been that storing 11 requires twice as many bits in which case the
> > > decission would have been good. One simply cannot quantize values optimally
> > > without considering the number of bits they need. This is even more true
> > > for vector quantization based codecs than it is for scalar quantization.
> > > it may very well be that psy thinks that both {-1,1} and {-2,0} are an
> > > equally good representation of the exact {-1.5,0.5} but its not until
> > > the encoding that it becomes known which of the two need fewer bits.
> > > 
> > > Id say the psy model should return an array of perceptual weights W[i]
> > > and the bitstream encode should choose the (global) minimum of
> > > bits[i] + distortion(W[i], coeff[i]-stored[i])
> > > where distortion is a appropriate function whos output matches how audible
> > > a change is, this may be a simple W[i]*(coeff[i]-stored[i])^2 but iam no
> > > psychoacoustic expert so there may be better choices.
> > > 
> > > And of course the suggested system above needs to be compared to what you
> > > have currenty so that we can be sure it really does sound better.
> > 
> > I understand what you mean but I suspect that is of complexity O("shaving piglets").
> > 
> > I followed 3GPP TS26.403 which relies on perceptual entropy which more
> > or less corresponds to the number of bits needed to code it since it's easier.
> > Anyway, it would be easy to implement psy model that will consider
> > real coding cost vs. distortion.
> 
> if you do not want to implement this then i will have to investigate if it
> is doable or not and why, could you provide me with some more elaborate
> explanation of where the problem is?

Current scheme (just to clarify things a bit):
1. encoder calls psy model functions to preprocess data
2. then encoder calls psy model to determine frame and window type
3. based on psy model suggestions, encoder performs windowing and MDCT
4. encoder feeds coefficients to psy model
5. psy model by some magic determines scalefactors and use them to convert
coefficients into integer form
6. encoder encodes obtained scalefactors and integer coefficients

There are 11 codebooks for AAC, each designed to code either pairs or quads
of values with sign coded separately or incorporated into value,
each has a maximum value limit.
While it's feasible to find the best encoding (like take raw coeff, quantize
it and round up or down, then see which vector takes less bits), I feel
it would be too slow.

> [...]
> 
> -- 
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> I have often repented speaking, but never of holding my tongue.
> -- Xenocrates