[FFmpeg-devel] [PATCH] WMA Voice decoder

Thu Jan 21 00:31:27 CET 2010

"Ronald S. Bultje" <rsbultje at gmail.com> writes:

> Hi,
>
> 2010/1/20 M?ns Rullg?rd <mans at mansr.com>:
>> "Ronald S. Bultje" <rsbultje at gmail.com> writes:
>>> +static const struct frame_type_desc {
>>> + ? ?short ? n_blocks, ? ? ///< amount of blocks per frame
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< (each block contains 160/#n_blocks samples)
>>> + ? ? ? ? ? ?acb_type, ? ? ///< Adaptive codebook type in frame/block:
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< - 0: fixed codebook with per-block/frame gain,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< - 1: adaptive codebook with per-frame pitch,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< - 2: adaptive codebook with per-block pitch
>>> + ? ? ? ? ? ?fcb_type, ? ? ///< Fixed codebook type in frame/block:
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< - ? 0: hardcoded codebook, per-frame gain,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< - ? 1: hardcoded codebook, per-block gain,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< - ? 2: pitch-adaptive window (AW) pulse signal,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< - 4-6: innovation (fixed) codebook pulses
>>> + ? ? ? ? ? ?dbl_pulses, ? ///< how many pulse vectors have pulse pairs
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< (rather than just one single pulse)
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< only if #fcb_type >= 4 && <= 6
>>> + ? ? ? ? ? ?frame_size; ? ///< the amount of bits that make up the block
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ?///< data (per frame)
>>> +} frame_descs[17] = {
>>> + ? ?{ 1, 0, 0, 0, ? 0 },
>>> + ? ?{ 2, 0, 1, 0, ?28 },
>>> + ? ?{ 2, 1, 2, 0, ?46 },
>>> + ? ?{ 2, 1, 4, 2, ?80 },
>>> + ? ?{ 2, 1, 4, 5, 104 },
>>> + ? ?{ 4, 1, 5, 0, 108 },
>>> + ? ?{ 4, 1, 5, 2, 132 },
>>> + ? ?{ 4, 1, 5, 5, 168 },
>>> + ? ?{ 2, 2, 4, 0, ?64 },
>>> + ? ?{ 2, 2, 4, 2, ?80 },
>>> + ? ?{ 2, 2, 4, 5, 104 },
>>> + ? ?{ 4, 2, 5, 0, 108 },
>>> + ? ?{ 4, 2, 5, 2, 132 },
>>> + ? ?{ 4, 2, 5, 5, 168 },
>>> + ? ?{ 8, 2, 6, 0, 176 },
>>> + ? ?{ 8, 2, 6, 2, 208 },
>>> + ? ?{ 8, 2, 6, 5, 256 }
>>> +};
>>
>> I suggest splitting the struct declaration from the frame_descs[]
>> definition. ?What you have there looks a little odd to me, though it
>> will of course work correctly.
>
> So the reason I did it this way is because other than this one use,
> the structure isn't really very meaningful, and this way I can
> doxument both struct and variable at the same time.

If it makes life simpler, fine with me.  I just thought it looked odd.

>>> +static int
>>> +get_vbm_bits(GetBitContext *gb)
>>> +{
>>> + ? ?int n, res;
>>> +
>>> + ? ?for (n = 0; ; n++) {
>>> + ? ? ? ?res = get_bits(gb, 2);
>>> + ? ? ? ?if (res < 3 || n == 6 /** don't increase n to 7 */)
>>> + ? ? ? ? ? ?break;
>>> + ? ?}
>>> +
>>> + ? ?return 3 * n + res;
>>> +}
>>
>> Is this called a lot? ?If yes, it can be optimised.
>
> Once per frame (160 samples), so not really.

Damn, bit-hacking is fun. ;-)

>>> +static void
>>> +dequant_lsps(double *lsps, int num,
>>> + ? ? ? ? ? ? const uint16_t *values, const uint16_t *sizes, int n_stages,
>>> + ? ? ? ? ? ? const uint8_t *table, const double *mul_q, const double *base_q)
>>> +{
>>> + ? ?int n, m;
>>> +
>>> + ? ?for (n = 0; n < num; n++) lsps[n] = 0.0;
>>> + ? ?for (n = 0; n < n_stages; table += sizes[n++] * num) {
>>> + ? ? ? ?for (m = 0; m < num; m++)
>>> + ? ? ? ? ? ?lsps[m] += base_q[n] + mul_q[n] * table[m + values[n] * num];
>>> + ? ?}
>>> +}
>>
>> Does this really need double precision? ?Certainly the inputs could be
>> single-precision even if the output needs more. ?The input values it
>> is called with look like single should be enough.
>
> This is the "some functions take double" thing. I haven't tested these
> small arrays with double vs. float because they are small. My
> hypothesis would be that it makes no difference, and I in fact think
> we should convert all these functions (including lsp2lpc) to float.

I think we should do that sooner rather than later then.

>> Another thing worth trying is storing the tables as the top 16 bits of
>> floats directly (the low 16 bits are all zeros for values up to 256).
>> While this doubles the table size, it avoids the int to float
>> conversion step.
>
> You mean "table" as variable right? It's HUGE, I mean HUGE, so I
> prefer to keep as uint8_t. It probably takes about 90% of
> wmavoice_data.h. I agre this is suboptimal, but don't really think
> there's an easy way around.

Maybe using the 8-bit values to index a table of 0..255 as float (or
half-float) would be faster.  But we can leave that for later.

>> This might use a bit of simd too.
>>
>> Is there any pattern in those tables that could be exploited?
>
> I've gotten rid of about as much as I could see already (the data
> tables in the binary are quite a bit bigger), but if anyone else sees
> more that I can get rid of, I'd love to hear it.

Shame.  Huge tables seem to be a hallmark of bad codec designs...

>>> + ? ? ? ? ? ? ? ?for (n = 0; n < size; n++)
>>> + ? ? ? ? ? ? ? ? ? ?excitation[n] = excitation[n - block_pitch];
>>
>> What are possible values for block_pitch?
>
> A typical value would be 40. I think you're trying to see if I could
> use memmove(), and no I tried but that fails because of the
> "non-destructiveness" of it (the destructiveness, or rather
> continuation, of the signal is intentional). As you'd expect, memcpy()
> has the same issue, which is why I kept the manual loop in place.
>
>>> + ? ?/** convert interpolated LSPs to LPCs */
>>> + ? ?fac = (block_idx + 0.5) / frame_desc->n_blocks;
>>> + ? ?for (n = 0; n < s->lsps; n++) ///< LSF -> LSP
>>> + ? ? ? ?i_lsps[n] = cos(prev_lsps[n] + fac * (lsps[n] - prev_lsps[n]));
>>
>> Calling trig functions is generally bad. ?Perhaps it can't be avoided
>> here though.
>
> You can't really prevent that, I'm affraid, or else things like
> ensuring min-distance wouldn't work as intended.

Too bad.  We should change it all to floats and use cosf() though.

>>> + ? ? ? ? ? ? ? ?pitch[n] = (fac ? ? ? ? ? ? ? ? * s->cur_pitch_val +
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?(n_blocks_x2 - fac) * s->last_pitch_val +
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?frame_descs[bd_idx].n_blocks) / n_blocks_x2;
>>
>> These could use MUL16/MAC16.
>
> Stupid question: why? (I seem to recall Michael claiming at some point
> that he didn't want functions / macros that would basically do
> something like this, I think he literally gave multiplication as an
> example, and lo-and-behold, there's MUL16! - so why is this
> preferrably over a generic multiplication?)

The operands here are restricted to 16 bits, but gcc has no way of
knowing that.  16x16 multiplications are often twice as fast as
32x32.  Using int16_t is not appropriate either, since then gcc would
be constantly masking the value to ensure it stays within 16 bits even
though we know it will never overflow.  Until compilers acquire some
means of providing such hints (and they make use of them), we're stuck
with macros like these.

>>> + ? ? ? ?synth_block(ctx, gb, n, block_nsamples, bl_pitch_sh2,
>>> + ? ? ? ? ? ? ? ? ? ?lsps, prev_lsps, &frame_descs[bd_idx],
>>> + ? ? ? ? ? ? ? ? ? ?&excitation[n * block_nsamples],
>>> + ? ? ? ? ? ? ? ? ? ?&synth[n * block_nsamples]);
>>> + ? ?}
>>
>> Does gcc split this loop in two, one for acb_type==2 and one for !=2?
>> If not, consider doing it manually. ?That if() block is big that it
>> probably makes a difference.
>
> gcc inlines all of these into a massive synth_superframe(). Since
> unlooping it would essentially result in duplicating the contents of
> synth_block into this massive inlined func (which doesn't occur), I
> think I'll conclude that it doesn't.
>
> But if it really inlines it, like ALL of it, is duplicating it such
> a good idea?

I suspect *not* inlining it and duplicating the loop will be faster.
Well have to benchmark both and compare.  That can wait, of course.

>>> +static void copy_bits(PutBitContext *pb,
>>> + ? ? ? ? ? ? ? ? ? ? ?const uint8_t *data, int size,
>>> + ? ? ? ? ? ? ? ? ? ? ?GetBitContext *gb, int nbits)
>>> +{
>>> + ? ?int rmn_bytes, rmn_bits;
>>> +
>>> + ? ?rmn_bits = rmn_bytes = get_bits_left(gb);
>>> + ? ?if (rmn_bits < nbits)
>>> + ? ? ? ?return;
>>> + ? ?rmn_bits &= 7; rmn_bytes >>= 3;
>>> + ? ?if ((rmn_bits = FFMIN(rmn_bits, nbits)) > 0)
>>> + ? ? ? ?put_bits(pb, rmn_bits, get_bits(gb, rmn_bits));
>>> + ? ?ff_copy_bits(pb, data + size - rmn_bytes,
>>> + ? ? ? ? ? ? ? ? FFMIN(nbits - rmn_bits, rmn_bytes << 3));
>>> +}
>>
>> This should probably go somewhere more generic.
>
> Yes, in wma.c, because wmapro & wma1/2 also use it. I was hoping
> nobody would notice and I could do that after the initial commit, but
> I guess I'm doomed. Damn it. Pretty please?

I see nothing wma-specific about that.  I think it should go right
next to ff_copy_bits with a suitable name.

>>> +/**
>>> + * @file libavcodec/wmavoice_data.h
>>> + * @brief Windows Media Voice (WMAVoice) tables
>>> + * @author Ronald S. Bultje <rsbultje at gmail.com>
>>> + */
>>
>> These tables are huge. ?Is there any structure in them that could be
>> used to reduce the size?
>
> Vitor and I are seeing if we can use generic sinc window tables for
> the ipol tables at the end, but other than that, 90% of this is dq
> tables and I'm affraid they're about as small as possible already...

OK, just checking, with the AAC codebooks in fresh memory...

-- 
M?ns Rullg?rd
mans at mansr.com