[FFmpeg-devel] [RFC] AAC Encoder

Fri Aug 15 09:11:03 CEST 2008

On Thu, Aug 14, 2008 at 11:42:44PM +0200, Michael Niedermayer wrote:
> > 
> > enum AACPsyModelMode{
> >     PSY_MODE_CBR,              ///< follow bitrate as closely as possible
> >     PSY_MODE_ABR,              ///< try to achieve bitrate but actual bitrate may differ significantly
> >     PSY_MODE_QUALITY,          ///< try to achieve set quality instead of bitrate
> > };
> > 
> > #define PSY_MODEL_MODE_MASK  0x0000000F ///< bit fields for storing mode (CBR, ABR, VBR)
> 
> please use bitrate tolterance/bitrate/max/min bitrate/buffer size/...
> from AVCodecContext for selecting the mode

I will, but I will keep those for internal state. 

> > #define PSY_MODEL_NO_PULSE   0x00000010 ///< disable pulse searching
> > #define PSY_MODEL_NO_SWITCH  0x00000020 ///< disable window switching
> > #define PSY_MODEL_NO_ST_ATT  0x00000040 ///< disable stereo attenuation
> > #define PSY_MODEL_NO_LOWPASS 0x00000080 ///< disable low-pass filtering
> 
> How does the user pass these to the codec?
> I suspect in AVCodecContext, if so above would be redundant and unneeded
> as AVCodecContext is availabe to the psy model

huh? I haven't seen flags for such thing in avcodec.h
Even if model takes flags from codec context, it needs to know its meaning

> also i think that the choice of how encode a coefficient, that is as a
> pulse or not is not a psychoacoustic question but one of entropy coding.
> "which way needs fewer bits has better RD"

yes, I think it may be merged into determining codebook sequence with Viterbi algorithm
(i.e. weight for codebook coded with pulses)

> > 
> > #define PSY_MODEL_NO_PREPROC (PSY_MODEL_NO_ST_ATT | PSY_MODEL_NO_LOWPASS)
> > 
> > #define PSY_MODEL_MODE(a)  ((a) & PSY_MODEL_MODE_MASK)
> > 
> > /**
> >  * context used by psychoacoustic model
> >  */
> > typedef struct AACPsyContext {
> >     AVCodecContext *avctx;            ///< encoder context
> > 
> >     int flags;                        ///< model flags
> 
> >     const uint8_t *bands1024;         ///< scalefactor band sizes for long (1024 samples) frame
> >     int num_bands1024;                ///< number of scalefactor bands for long frame
> >     const uint8_t *bands128;          ///< scalefactor band sizes for short (128 samples) frame
> >     int num_bands128;                 ///< number of scalefactor bands for short frame
> 
> This is a little AAC specific but then its called AACPsyContext
> so iam not sure. Is the code supposed to be a generic psychoacoustic model
> or AAC specific?

AAC-specific. I thinks it's possible to make it more generic, but it will require
some radical changes, especially for window switching code and scalefactors.

> [...]
> > /**
> >  * Convert coefficients to integers.
> >  * @return sum of coefficients
> >  * @see 3GPP TS26.403 5.6.2 "Scalefactor determination"
> >  */
> > static inline int convert_coeffs(float *in, int *out, int size, int scale_idx)
> 
> quantize_coeffs
> and scale_idx should be replaced by a quantization factor.
> 
> 
> > {
> >     int i, sign, sum = 0;
> >     for(i = 0; i < size; i++){
> >         sign = in[i] > 0.0;
> >         out[i] = (int)(pow(FFABS(in[i]) * ff_aac_pow2sf_tab[200 - scale_idx + SCALE_ONE_POS - SCALE_DIV_512], 0.75) + 0.4054);
> 
> fabs()
> 
> 
> >         out[i] = av_clip(out[i], 0, 8191);
> >         sum += out[i];
> >         if(sign) out[i] = -out[i];
> >     }
> >     return sum;
> > }
> 
> 
> 
> > 
> > static inline float unquant(int q, int scale_idx){
> >     return (FFABS(q) * cbrt(q*1.0)) * ff_aac_pow2sf_tab[200 + scale_idx - SCALE_ONE_POS];
> > }
> 
> also please replace scale_idx by a factor, repeatly doing these lookups is
> likely inefficient, also it is unflexible in relation to non aac
> 
> 
> > static inline float calc_distortion(float *c, int size, int scale_idx)
> > {
> >     int i;
> >     int q;
> >     float coef, unquant, sum = 0.0f;
> >     for(i = 0; i < size; i++){
> >         coef = FFABS(c[i]);
> >         q = (int)(pow(FFABS(coef) * ff_aac_pow2sf_tab[200 - scale_idx + SCALE_ONE_POS - SCALE_DIV_512], 0.75) + 0.4054);
> >         q = av_clip(q, 0, 8191);
> >         unquant = (q * cbrt(q)) * ff_aac_pow2sf_tab[200 + scale_idx - SCALE_ONE_POS + SCALE_DIV_512];
> >         sum += (coef - unquant) * (coef - unquant);
> >     }
> >     return sum;
> > }
> 
> I think this and previous functions have some common code that can be
> factorized out
> 
> 
> [...]
> > static void psy_null8_process(AACPsyContext *apc, int tag, int type, ChannelElement *cpe)
> > {
> >     int start;
> >     int w, ch, g, i;
> >     int chans = type == ID_CPE ? 2 : 1;
> > 
> >     //detect M/S
> >     if(chans > 1 && cpe->common_window){
> >         start = 0;
> >         for(w = 0; w < cpe->ch[0].ics.num_windows; w++){
> >             for(g = 0; g < cpe->ch[0].ics.num_swb; g++){
> >                 float diff = 0.0f;
> > 
> >                 for(i = 0; i < cpe->ch[0].ics.swb_sizes[g]; i++)
> >                     diff += fabs(cpe->ch[0].coeffs[start+i] - cpe->ch[1].coeffs[start+i]);
> >                 cpe->ms.mask[w][g] = diff == 0.0;
> >             }
> >         }
> >     }
> 
> the mid side bits should also be detected ideally by encoding both ways
> and choosing by rate distortion
> 
> above really looks a little lame, one should at least calculate either
> bits or distortion and choose based on that if both are not ...

This is just a sample model to exercise encoder capabilities.
I will include my working model next time.

> [...]
> -- 
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> it is not once nor twice but times without number that the same ideas make
> their appearance in the world. -- Aristotle