[Ffmpeg-devel] Fixed point arithmetic RealAudio G2 (cook) decoder, v2

Wed Mar 21 21:59:13 CET 2007

Hi,

Here's the second iteration of my fixed point COOK decoder.

Many thanks to Michael and Benjamin for the code reviews.
I've updated the patch accordingly, here are some comments to the comments.

Michael Niedermayer writes:
>> +static const FIXPU sincos_lookup[2050] = {
>> +    /* x_i = 2^16 sin(i 2pi/8192), 2^16 cos(i 2pi/8192); i=0..1024 */
>
>i think such large tables should be generated instead of hardcoded

Done.

Michael Niedermayer writes:
>> diff -upN -x 'ff*' coef/cook_fixp_mdct.h fixpoint/cook_fixp_mdct.h
>> --- coef/cook_fixp_mdct.h        1970-01-01 01:00:00.000000000 +0100
>> +++ fixpoint/cook_fixp_mdct.h        2007-03-08 11:10:13.000000000 +0100
>
> adding a generic fixpoint mdct to lavc MUST be a seperate patch&mail&commit
> (this part also doesnt fall under benjamins maintainership but mine)

and Benjamin Larsson writes:
> The fixedpoint math
> operattions should be moved to some generic fixedpoint.h file for the
> possible use in other codecs. The fixedpoint mdct should be moved to
> mdct.h.

I don't see it as a generic fixpoint MDCT, at least not yet!
I made some choices when implementing the decoder, for example to use
32 bit signed integers for variables and 16 bit unsigneds for coefficients.
I borrowed the MDCT from Tremor and adapted it for this.
I also removed some coefficient interpolation used for very large transforms,
that COOK doesn't need.
Turned out to be OK here, but I don't know how suitable it would be for
anything else.

If you still want the fixpoint math + MDCT submitted as generics, I
would really appreciate some tips or pointers as to how it should be done.

Michael Niedermayer writes:
>> +/**
>> + * Fixed point multiply by power of two.
>> + *
>> + * @param x                     fix point value
>> + * @param i                     integer power-of-two, -31..+31
>> + */
>> +static inline FIXP fixp_pow2(FIXP x, int i)
>> +{
>> +  if (i < 0)
>> +    return (x >> -i) + ((x >> (-i-1)) & 1);
>> +  else
>> +    return x << i;              /* no check for overflow */
>> +}
>
> for left shift id simply use << not fixp_pow2()
>
> for right shift with rounding id use (x + (1<<(i-1)))>>i
> the advantage here is that if i is known this simplifies to (x + C)>>C2
> or is there any case where the direction of the shift is not known?

OK - I've re-ordered as you suggest to aid simplification when i is known,
and added a shift-right function for when the direction is known to be
negative.
There are also uses where the shift direction is not known in advance.

Michael Niedermayer writes:
>> +/**
>> + * Fixed point multiply by fraction.
>> + *
>> + * @param a                     fix point value
>> + * @param b                     fix point fraction, 0 <= b < 1
>> + */
>> +static inline FIXP fixp_mult_su(FIXP a, FIXPU b)
>> +{
>> +    int32_t hb = (a >> 16) * b;
>> +    uint32_t lb = (a & 0xffff) * b;
>> +
>> +    return hb + (lb >> 16) + ((lb & 0x8000) >> 15);
>
> return ((int64_t)a * b + 0x8000) >> 16;
>
> (its the compiers job to optimize this and if it fails such code
> should be in ASM under appropriate #ifdef, the above may be optimal
> for ARM but certainly not for every CPU)
>
> also its probably worth trying to do this without correct rounding ...

I've adopted your rounding method, it's much better.

I don't understand why you say you would rewrite something in C, that
does what you want, into assembler.
One of C's great strengths (in my opinion) is that often it lets you
express low level details _without_ resorting to assembler.

This code isn't aimed specifically at ARM, but it was written with
a 32 bit CPU in mind, I'll admit. That's why it uses two 16x16->32bit
multiplications, instead of a 64bit one.

Since you mentioned the optimizer, and suggested an alternative,
I spent the time to investigate a bit, do some timings and look at
the assembler produced on x86 and ARM.

Timings: decode a 273s sample with ffmpeg,

On x86  -  64bit: 3.8s,  modified64bit: 3.3s,  32bit:  2.3s.
On ARM  -  64bit: 63.4s,  32bit: 34.0s.

Looking at the assembler listing,
On x86:
  The 64bit version uses a single 64x64->64 multiply.
  With a little help (voodoo!) gcc can be encouraged to see that
    the 64bit version only needs a 32x32->64 multiply.
  The original uses two 16x16->32 multiplies.

On ARM:
  The 64bit version calls _muldi3(), a 64x64->64 multiply (?).
  The original uses two 32x32->32 multiplies.

For the moment I've left all three versions in the code, in case anyone
else wants to look at this.

Michael Niedermayer writes:
>> +/**
>> + * The real requantization of the mltcoefs
>> + *
>> + * @param q                     pointer to the COOKContext
>> + * @param index                 index
>> + * @param quant_index           quantisation index for this band
>> + * @param subband_coef_index    array of indexes to quant_centroid_tab
>> + * @param subband_coef_sign     use random noise instead of predetermined value
>> + * @param mlt_ptr               pointer to the mlt coefficients
>> + */
>> +static void scalar_dequant_math(COOKContext *q, int index,
>> +                                int quant_index, int* subband_coef_index,
>> +                                int* subband_coef_sign, REAL_T *mlt_p)
>> +{
>> +    /* Num. half bits to right shift */
>> +    const int s = 33 - quant_index + av_log2(q->samples_per_channel);
>> +    const FIXP *table = quant_tables[s & 1][index];
>> +    FIXP f;
>> +    int i;
>> +
>> +    for(i=0 ; i<SUBBAND_SIZE ; i++) {
>> +        f = table[subband_coef_index[i]];
>> +        /* noise coding if subband_coef_index[i] == 0 */
>> +        if (((subband_coef_index[i] == 0) && cook_random(q)) ||
>> +            ((subband_coef_index[i] != 0) && subband_coef_sign[i]))
>> +            f = -f;
>> +        mlt_p[i] = (s >= 64) ? 0 : fixp_pow2(f, -(s/2));
>> +    }
>
> if(s>=64){
>     memset(mlt_p, 0, ...)
>     return;
> }
> assuming this case is possible at all

Done. (And yes - it does happen in most samples.)

Michael Niedermayer writes:
>> +/**
>> + * the actual requantization of the timedomain samples
>> + *
>> + * @param q                 pointer to the COOKContext
>> + * @param buffer            pointer to the timedomain buffer
>> + * @param gain_index        index for the block multiplier
>> + * @param gain_index_next   index for the next block multiplier
>> + */
>> +static inline void
>> +interpolate_math(COOKContext *q, FIXP* buffer,
>> +                 int gain_index, int gain_index_next)
>> +{
>> +    int i;
>> +    int gain_size_factor = q->samples_per_channel / 8;
>> +
>> +    if(gain_index == gain_index_next){              //static gain
>> +        for(i = 0; i < gain_size_factor; i++) {
>> +            buffer[i] = fixp_pow2(buffer[i], gain_index);
>> +        }
>> +    } else {                                        //smooth gain
>> +        int step = (gain_index_next - gain_index)
>> +                   << (7 - av_log2(gain_size_factor));
>> +        int x = 0;
>> +
>> +        for(i = 0; i < gain_size_factor; i++) {
>> +            buffer[i] = fixp_mult_su(buffer[i], pow128_tab[x]);
>> +            buffer[i] = fixp_pow2(buffer[i], gain_index+1);
>> +
>> +            x += step;
>> +            gain_index += (x + 128) / 128 - 1;
>> +            x = (x + 128) % 128;
>> +        }
>
> gain_index += x>>7;
> x&= 127

Done.

>also fixp_mult_su() already does a shift so the fixp_pow2 can maybe
>be merged into that

Done.

Michael Niedermayer writes:
>> +/**
>> + * Final converion from floating point values to
>> + * signed, 16 bit sound samples. Round and clip.
>> + *
>> + * @param q                 pointer to the COOKContext
>> + * @param out               pointer to the output buffer
>> + * @param chan              0: left or single channel, 1: right channel
>> + */
>> +static inline void output_math(COOKContext *q, int16_t *out, int chan)
>> +{
>> +    int j;
>> +
>> +    for (j = 0; j < q->samples_per_channel; j++) {
>> +        out[chan + q->nb_channels * j] > +
> av_clip(fixp_pow2(q->mono_mdct_output[j], -11), -32768, 32767);
>
> for(j = 0; j < q->samples_per_channel; j++) {
>     int v= q->mono_mdct_output[j] >> 11;
>     if(v+32768U > 65535U) v= (v>>31) ^ 0x7FFF;
>     *out= v;
>     out += q->nb_channels;
> }

Done.
(Dreadfully unreadable code though - maybe it should be wrapped up
 in something like av_clip_s16()?)

Michael Niedermayer writes:
>> +/**
>> + * Final converion from floating point values to
>> + * signed, 16 bit sound samples. Round and clip.
>> + *
>> + * @param q                 pointer to the COOKContext
>> + * @param out               pointer to the output buffer
>> + * @param chan              0: left or single channel, 1: right channel
>> + */
>> +static inline void output_math(COOKContext *q, int16_t *out, int chan)
>> +{
>> +    int j;
>> +
>> +    for (j = 0; j < q->samples_per_channel; j++) {
>> +        out[chan + q->nb_channels * j] =
>> +          av_clip(lrintf(q->mono_mdct_output[j]), -32768, 32767);
>> +    }
>> +}
>
> DSPContext.float_to_int16()

Quite possibly -- I'd rather keep unrelated improvements out of this
fixed point patch, though. Also, I'm not familiar enough with the DSP
functions to just do this in passing.
I've put a comment in the code so the idea doesn't get lost.

Benjamin Larsson writes:
> Imo the float and fixedpoint code and tables
> can share the same file with the help of defines.

Yes, I've done this for the tables.
Both "cookdata_fixpoint.h" and "cookdata_float.h" now set up some
defines and then include "cookdata.h".
Originally I was thinking the fixed point version should be completely
float-free. With this, there's some float conversion at compile time,
but the source is much simpler.

I'm not so convinced about doing the same for the code.

Michael Niedermayer writes:
> this should have been diffed against cook.c, it cannot be reviewed otherwise
> as the code is from cook.c
>
> [not reviewed code]

Yes, I can see that wasn't very clever.

In this patch, then, there's

"cook.c"           diff
"cookdata.h"       diff
"cook_float.h"     diffed against "cook.c"
"cookdata_float.h" diffed against "cook.c"
                   (only about 8 lines in common though)
"cook_fixed.h"     new file
"cook_fixp_mdct.h" new file
"cookdata_fixed.h" new file

 cook.c              |  254 ++---------
 cook_fixp_mdct.h    |  551 +++++++++++++++++++++++++
 cook_fixpoint.h     |  301 +++++++++++++
 cook_float.h        | 1065 +++----------------------------------------------
 cookdata.h          |  125 +++--
 cookdata_fixpoint.h |   82 +++
 cookdata_float.h    | 1126 ----------------------------------------------------
 7 files changed, 1151 insertions(+), 2353 deletions(-)

I considered diff'ing "cook_fixed.h" against "cook_float.h" and
"cookdata_fixed.h" against "cookdata_float.h", since they have
similar structures and some shared comments.
Not really shared code, though, and the diffs are pretty unreadable.

Regards,
Ian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fixp2.patch
Type: text/x-patch
Size: 129549 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070321/bb86a11a/attachment.bin>