[Ffmpeg-devel] Fixed point arithmetic RealAudio G2 (cook) decoder, v2

Fri Mar 23 23:52:21 CET 2007

Hi

On Wed, Mar 21, 2007 at 09:59:13PM +0100, Ian Braithwaite wrote:
> Hi,
> 
> 
> Here's the second iteration of my fixed point COOK decoder.
> 
> Many thanks to Michael and Benjamin for the code reviews.
> I've updated the patch accordingly, here are some comments to the comments.
[...]
> Michael Niedermayer writes:
> >> diff -upN -x 'ff*' coef/cook_fixp_mdct.h fixpoint/cook_fixp_mdct.h
> >> --- coef/cook_fixp_mdct.h        1970-01-01 01:00:00.000000000 +0100
> >> +++ fixpoint/cook_fixp_mdct.h        2007-03-08 11:10:13.000000000 +0100
> >
> > adding a generic fixpoint mdct to lavc MUST be a seperate patch&mail&commit
> > (this part also doesnt fall under benjamins maintainership but mine)
> 
> and Benjamin Larsson writes:
> > The fixedpoint math
> > operattions should be moved to some generic fixedpoint.h file for the
> > possible use in other codecs. The fixedpoint mdct should be moved to
> > mdct.h.
> 
> I don't see it as a generic fixpoint MDCT, at least not yet!
> I made some choices when implementing the decoder, for example to use
> 32 bit signed integers for variables and 16 bit unsigneds for coefficients.
> I borrowed the MDCT from Tremor and adapted it for this.
> I also removed some coefficient interpolation used for very large transforms,
> that COOK doesn't need.
> Turned out to be OK here, but I don't know how suitable it would be for
> anything else.
> 
> If you still want the fixpoint math + MDCT submitted as generics, I
> would really appreciate some tips or pointers as to how it should be done.

use the math ops from mathops.h, and if anything you wrote is faster replace
the specific one in mathops.h also dont hesitate to suggest additions to
mathops.h

[...]
> Michael Niedermayer writes:
> >> +/**
> >> + * Fixed point multiply by fraction.
> >> + *
> >> + * @param a                     fix point value
> >> + * @param b                     fix point fraction, 0 <= b < 1
> >> + */
> >> +static inline FIXP fixp_mult_su(FIXP a, FIXPU b)
> >> +{
> >> +    int32_t hb = (a >> 16) * b;
> >> +    uint32_t lb = (a & 0xffff) * b;
> >> +
> >> +    return hb + (lb >> 16) + ((lb & 0x8000) >> 15);
> >
> > return ((int64_t)a * b + 0x8000) >> 16;
> >
> > (its the compiers job to optimize this and if it fails such code
> > should be in ASM under appropriate #ifdef, the above may be optimal
> > for ARM but certainly not for every CPU)
> >
> > also its probably worth trying to do this without correct rounding ...
> 
> I've adopted your rounding method, it's much better.
> 
> I don't understand why you say you would rewrite something in C, that
> does what you want, into assembler.
> One of C's great strengths (in my opinion) is that often it lets you
> express low level details _without_ resorting to assembler.

its not about expressing details but about speed, hand written asm
simply tends to be faster then what gcc generates, you also can 
surely tweak the c code until gcc generates equally optimal code but
as soon as you upgrade to the next version of gcc you might have
to redo this. OTOH with asm gcc cant mess up, it has no choice but to
output the code which it should

the mess gcc generates out of some multiply constructs below is a good
example, next version of gcc could as well behave completely differently

> 
> This code isn't aimed specifically at ARM, but it was written with
> a 32 bit CPU in mind, I'll admit. That's why it uses two 16x16->32bit
> multiplications, instead of a 64bit one.
> 
> 
> Since you mentioned the optimizer, and suggested an alternative,
> I spent the time to investigate a bit, do some timings and look at
> the assembler produced on x86 and ARM.
> 
> Timings: decode a 273s sample with ffmpeg,
> 
> On x86  -  64bit: 3.8s,  modified64bit: 3.3s,  32bit:  2.3s.
> On ARM  -  64bit: 63.4s,  32bit: 34.0s.
> 
> Looking at the assembler listing,
> On x86:
>   The 64bit version uses a single 64x64->64 multiply.
>   With a little help (voodoo!) gcc can be encouraged to see that
>     the 64bit version only needs a 32x32->64 multiply.
>   The original uses two 16x16->32 multiplies.
> 
> On ARM:
>   The 64bit version calls _muldi3(), a 64x64->64 multiply (?).
>   The original uses two 32x32->32 multiplies.
> 
> 
> For the moment I've left all three versions in the code, in case anyone
> else wants to look at this.

[...]

> Michael Niedermayer writes:
> >> +/**
> >> + * Final converion from floating point values to
> >> + * signed, 16 bit sound samples. Round and clip.
> >> + *
> >> + * @param q                 pointer to the COOKContext
> >> + * @param out               pointer to the output buffer
> >> + * @param chan              0: left or single channel, 1: right channel
> >> + */
> >> +static inline void output_math(COOKContext *q, int16_t *out, int chan)
> >> +{
> >> +    int j;
> >> +
> >> +    for (j = 0; j < q->samples_per_channel; j++) {
> >> +        out[chan + q->nb_channels * j] > +
> > av_clip(fixp_pow2(q->mono_mdct_output[j], -11), -32768, 32767);
> >
> > for(j = 0; j < q->samples_per_channel; j++) {
> >     int v= q->mono_mdct_output[j] >> 11;
> >     if(v+32768U > 65535U) v= (v>>31) ^ 0x7FFF;
> >     *out= v;
> >     out += q->nb_channels;
> > }
> 
> Done.
> (Dreadfully unreadable code though - maybe it should be wrapped up
>  in something like av_clip_s16()?)

no objection, seperate patch to add av_clip_s16() is welcome

[...]

> +    /* Variables for fixed/float arithmetic routines */
> +    realvars_t          math;

comment isnt doxygen compatible

[...]
> +/**
> + * Additional variables in COOKContext
> + * for fixed point routines
> + */
> +typedef struct {
> +    /* generatable tables */
> +    /**
> +     * Sine/cosine table.
> +     *  x_i = 2^16 sin(i 2pi/8192), 2^16 cos(i 2pi/8192); i=0..1024
> +     */
> +    FIXPU sincos_lookup[2050];

shouldnt this table be static instead of duplicated in the context?

[...]
> +
> +#define STIN static inline

ugly

> +
> +typedef int32_t ogg_int32_t;

ogg_* is completely unacceptable unless this would stay extreemly similar to
the code from tremor so that future versions could easily be droped in but
that also would make it impossible for us to optimize the code so
considering
A. ugly code
   easy drop in
   no optimizations and big changes possible
B.
   clean code
   drop in for future tremor imdct hard
   optimizations and big changes possible

now i dont belive that there will be much improvments coming from xiph in
the future so B seems like the more logical choice

> +
> +#define DATA_TYPE ogg_int32_t
> +#define REG_TYPE  register ogg_int32_t
> +#define LOOKUP_T const uint16_t
> +
> +static inline ogg_int32_t MULT32(ogg_int32_t x, ogg_int32_t y) {
> +  return fixp_mult_pow2(x, y, -1);
> +}

such wrapers are ugly, even more so with wrong indention

> +
> +static inline ogg_int32_t MULT31(ogg_int32_t x, ogg_int32_t y) {
> +  return fixp_mult(x, y);
> +}
> +
> +/*
> + * This should be used as a memory barrier, forcing all cached values in
> + * registers to wr writen back to memory.  Might or might not be beneficial
> + * depending on the architecture and compiler.
> + */
> +#define MB()

useless nop

> +
> +/*
> + * The XPROD functions are meant to optimize the cross products found all
> + * over the place in mdct.c by forcing memory operation ordering to avoid
> + * unnecessary register reloads as soon as memory is being written to.
> + * However this is only beneficial on CPUs with a sane number of general
> + * purpose registers which exclude the Intel x86.  On Intel, better let the
> + * compiler actually reload registers directly from original memory by using
> + * macros.
> + */
> +
> +#ifdef __i386__
> +
> +#define XPROD32(_a, _b, _t, _v, _x, _y)		\
> +  { *(_x)=MULT32(_a,_t)+MULT32(_b,_v);		\
> +    *(_y)=MULT32(_b,_t)-MULT32(_a,_v); }
> +#define XPROD31(_a, _b, _t, _v, _x, _y)		\
> +  { *(_x)=MULT31(_a,_t)+MULT31(_b,_v);		\
> +    *(_y)=MULT31(_b,_t)-MULT31(_a,_v); }
> +#define XNPROD31(_a, _b, _t, _v, _x, _y)	\
> +  { *(_x)=MULT31(_a,_t)-MULT31(_b,_v);		\
> +    *(_y)=MULT31(_b,_t)+MULT31(_a,_v); }
> +
> +#else

tabs are forbidden in svn
__i386__ is wrong theres ARCH_X86 (_32/_64)

> +
> +static inline void XPROD32(ogg_int32_t  a, ogg_int32_t  b,
> +			   ogg_int32_t  t, ogg_int32_t  v,
> +			   ogg_int32_t *x, ogg_int32_t *y)
> +{
> +  *x = MULT32(a, t) + MULT32(b, v);
> +  *y = MULT32(b, t) - MULT32(a, v);
> +}
> +
> +static inline void XPROD31(ogg_int32_t  a, ogg_int32_t  b,
> +			   ogg_int32_t  t, ogg_int32_t  v,
> +			   ogg_int32_t *x, ogg_int32_t *y)
> +{
> +  *x = MULT31(a, t) + MULT31(b, v);
> +  *y = MULT31(b, t) - MULT31(a, v);
> +}
> +
> +static inline void XNPROD31(ogg_int32_t  a, ogg_int32_t  b,
> +			    ogg_int32_t  t, ogg_int32_t  v,
> +			    ogg_int32_t *x, ogg_int32_t *y)
> +{
> +  *x = MULT31(a, t) - MULT31(b, v);
> +  *y = MULT31(b, t) + MULT31(a, v);
> +}
> +
> +#endif
> +
> +
> +/* 8 point butterfly (in place) */
> +STIN void mdct_butterfly_8(DATA_TYPE *x){
> +
> +  REG_TYPE r0   = x[4] + x[0];
> +  REG_TYPE r1   = x[4] - x[0];
> +  REG_TYPE r2   = x[5] + x[1];
> +  REG_TYPE r3   = x[5] - x[1];
> +  REG_TYPE r4   = x[6] + x[2];
> +  REG_TYPE r5   = x[6] - x[2];
> +  REG_TYPE r6   = x[7] + x[3];
> +  REG_TYPE r7   = x[7] - x[3];
> +
> +	   x[0] = r5   + r3;
> +	   x[1] = r7   - r1;
> +	   x[2] = r5   - r3;
> +	   x[3] = r7   + r1;
> +           x[4] = r4   - r0;
> +	   x[5] = r6   - r2;
> +           x[6] = r4   + r0;
> +	   x[7] = r6   + r2;
> +	   MB();
> +}

a+b / a-b aka butterfly can be put into its own function/macro

the remainder of the mdct code also likely can be simplified a lot

[...]
> +static inline int init_cook_math(COOKContext *q)
> +{
> +    FIXPU *const sincos_lookup = q->math.sincos_lookup;
> +    FIXP s = 0, c = 0x80000000; /* 0.0, -1.0 */
> +    uint16_t a = 0xc910;        /* 2^14 pi */
> +    int i = 0;
> +
> +    sincos_lookup[i++] = 0x0000;
> +    sincos_lookup[i++] = 0xffff;
> +
> +    while (i < 2050) {
> +        FIXP s2 = s + fixp_mult_pow2(c - fixp_mult_pow2(s, a, -11), a, -10);
> +        FIXP c2 = c - fixp_mult_pow2(s + fixp_mult_pow2(c, a, -11), a, -10);
> +
> +        s = s2;
> +        c = c2;

the c2 variable isnt needed 

[...]
> +static void scalar_dequant_math(COOKContext *q, int index, int quant_index,
> +                                int* subband_coef_index,
> +                                int* subband_coef_sign, FIXP *mlt_p)
> +{
> +    /* Num. half bits to right shift */
> +    const int s = 33 - quant_index + av_log2(q->samples_per_channel);
> +    FIXP f1;
> +    int i;
> +
> +    if (s >= 64) {
> +        memset(mlt_p, 0, sizeof(FIXP) * SUBBAND_SIZE);
> +        return;
> +    }
> +
> +    for(i=0 ; i<SUBBAND_SIZE ; i++) {
> +        if (subband_coef_index[i]) {
> +            f1 = quant_centroid_tab[index][subband_coef_index[i]][s&1];
> +            if (subband_coef_sign[i]) f1 = -f1;
> +        } else {
> +            /* noise coding if subband_coef_index[i] == 0 */
> +            f1 = dither_tab[index][s&1];
> +            if (av_random(&q->random_state) < 0x80000000) f1 = -f1;
> +        }
> +        mlt_p[i] = fixp_shr(f1, s/2);
> +    }

wouldnt it be faster to
 tab = &quant_centroid_tab[index][0][s&1];
outside the loop

and then use tab[subband_coef_index[i]][0]

(this and all other optimizations can of course can be done as seperate
patches, in many cases there is similar code for floats too which might
benefit ...)

> +}
> +
> +
> +/**
> + * the actual requantization of the timedomain samples
> + *
> + * @param q                 pointer to the COOKContext
> + * @param buffer            pointer to the timedomain buffer
> + * @param gain_index        index for the block multiplier
> + * @param gain_index_next   index for the next block multiplier
> + */
> +static inline void interpolate_math(COOKContext *q, FIXP* buffer,
> +                                    int gain_index, int gain_index_next)
> +{
> +    int gain_size_factor = q->samples_per_channel/8;
> +    int i;
> +
> +    if(gain_index == gain_index_next){              //static gain
> +        for(i = 0; i < gain_size_factor; i++) {
> +            buffer[i] = fixp_pow2(buffer[i], gain_index);
> +        }

is this speed critical? if yes
if(gain_index < FOOBAR)
    for(i = 0; i < gain_size_factor; i++)
        ...
else
    for(...)
        ...

would be faster, if its not speed critical just forget my comment

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Let us carefully observe those good qualities wherein our enemies excel us
and endeavor to excel them, by avoiding what is faulty, and imitating what
is excellent in them. -- Plutarch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070323/943d521a/attachment.pgp>