[FFmpeg-devel] [PATCH] Split reading and decoding of blocks in ALS

Michael Niedermayer michaelni
Wed Nov 25 17:02:15 CET 2009


On Wed, Nov 25, 2009 at 01:26:57PM +0100, Thilo Borgmann wrote:
> Michael Niedermayer schrieb:
> > On Mon, Nov 23, 2009 at 06:24:00PM -0500, Justin Ruggles wrote:
> >> Thilo Borgmann wrote:
> >>
> >>> Michael Niedermayer schrieb:
> >>>> On Mon, Nov 23, 2009 at 11:28:35AM +0100, Thilo Borgmann wrote:
> >>>>> Michael Niedermayer schrieb:
> >>>>>> On Sat, Nov 14, 2009 at 02:15:30PM +0100, Thilo Borgmann wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> in order to support multi-channel correlation, reading and decoding of a
> >>>>>>> block has to be separated. This patch introduces ALSBockData struct for
> >>>>>>> that purpose.
> >>>>>> [...]
> >>>>>>> +/** Decodes the block data for a constant block
> >>>>>>> + */
> >>>>>>> +static void decode_const_block_data(ALSDecContext *ctx, ALSBlockData *bd)
> >>>>>>> +{
> >>>>>>> +    int smp;
> >>>>>>> +
> >>>>>>>      // write raw samples into buffer
> >>>>>>> -    for (k = 0; k < block_length; k++)
> >>>>>>> -        raw_samples[k] = const_val;
> >>>>>>> +    for (smp = 0; smp < bd->block_length; smp++)
> >>>>>>> +        bd->raw_samples[smp] = bd->const_val;
> >>>>>>>  }
> >>>>>> memset32(dst, val, len){
> >>>>>> }
> >>>>>> would possibly be faster due to not doing bd-> in the loop
> >>>>> Ok will do...
> >>>>>
> >>>>>
> >>>>>> [...]
> >>>>>>> @@ -553,115 +579,132 @@
> >>>>>>>      }
> >>>>>>>  
> >>>>>>>      if (get_bits1(gb))
> >>>>>>> -        *shift_lsbs = get_bits(gb, 4) + 1;
> >>>>>>> +        bd->shift_lsbs = get_bits(gb, 4) + 1;
> >>>>>>>  
> >>>>>>> -    store_prev_samples = (*js_blocks && raw_other) || *shift_lsbs;
> >>>>>>> +    bd->store_prev_samples = (bd->js_blocks && bd->raw_other) || bd->shift_lsbs;
> >>>>>>>  
> >>>>>>> -
> >>>>>>>      if (!sconf->rlslms) {
> >>>>>>>          if (sconf->adapt_order) {
> >>>>>>> -            int opt_order_length = av_ceil_log2(av_clip((block_length >> 3) - 1,
> >>>>>>> +            int opt_order_length = av_ceil_log2(av_clip((bd->block_length >> 3) - 1,
> >>>>>>>                                                  2, sconf->max_order + 1));
> >>>>>>> -            opt_order            = get_bits(gb, opt_order_length);
> >>>>>>> +            bd->opt_order        = get_bits(gb, opt_order_length);
> >>>>>>>          } else {
> >>>>>>> -            opt_order = sconf->max_order;
> >>>>>>> +            bd->opt_order = sconf->max_order;
> >>>>>>>          }
> >>>>>>>  
> >>>>>>> -        if (opt_order) {
> >>>>>>> +        if (bd->opt_order) {
> >>>>>>>              int add_base;
> >>>>>>>  
> >>>>>>>              if (sconf->coef_table == 3) {
> >>>>>>>                  add_base = 0x7F;
> >>>>>>>  
> >>>>>>>                  // read coefficient 0
> >>>>>>> -                quant_cof[0] = 32 * parcor_scaled_values[get_bits(gb, 7)];
> >>>>>>> +                bd->quant_cof[0] = 32 * parcor_scaled_values[get_bits(gb, 7)];
> >>>>>>>  
> >>>>>>>                  // read coefficient 1
> >>>>>>> -                if (opt_order > 1)
> >>>>>>> -                    quant_cof[1] = -32 * parcor_scaled_values[get_bits(gb, 7)];
> >>>>>>> +                if (bd->opt_order > 1)
> >>>>>>> +                    bd->quant_cof[1] = -32 * parcor_scaled_values[get_bits(gb, 7)];
> >>>>>>>  
> >>>>>>>                  // read coefficients 2 to opt_order
> >>>>>>> -                for (k = 2; k < opt_order; k++)
> >>>>>>> -                    quant_cof[k] = get_bits(gb, 7);
> >>>>>>> +                for (k = 2; k < bd->opt_order; k++)
> >>>>>>> +                    bd->quant_cof[k] = get_bits(gb, 7);
> >>>>>>>              } else {
> >>>>>>>                  int k_max;
> >>>>>>>                  add_base = 1;
> >>>>>>>  
> >>>>>>>                  // read coefficient 0 to 19
> >>>>>>> -                k_max = FFMIN(opt_order, 20);
> >>>>>>> +                k_max = FFMIN(bd->opt_order, 20);
> >>>>>>>                  for (k = 0; k < k_max; k++) {
> >>>>>>>                      int rice_param = parcor_rice_table[sconf->coef_table][k][1];
> >>>>>>>                      int offset     = parcor_rice_table[sconf->coef_table][k][0];
> >>>>>>> -                    quant_cof[k] = decode_rice(gb, rice_param) + offset;
> >>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, rice_param) + offset;
> >>>>>>>                  }
> >>>>>>>  
> >>>>>>>                  // read coefficients 20 to 126
> >>>>>>> -                k_max = FFMIN(opt_order, 127);
> >>>>>>> +                k_max = FFMIN(bd->opt_order, 127);
> >>>>>>>                  for (; k < k_max; k++)
> >>>>>>> -                    quant_cof[k] = decode_rice(gb, 2) + (k & 1);
> >>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, 2) + (k & 1);
> >>>>>>>  
> >>>>>>>                  // read coefficients 127 to opt_order
> >>>>>>> -                for (; k < opt_order; k++)
> >>>>>>> -                    quant_cof[k] = decode_rice(gb, 1);
> >>>>>>> +                for (; k < bd->opt_order; k++)
> >>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, 1);
> >>>>>>>  
> >>>>>>> -                quant_cof[0] = 32 * parcor_scaled_values[quant_cof[0] + 64];
> >>>>>>> +                bd->quant_cof[0] = 32 * parcor_scaled_values[bd->quant_cof[0] + 64];
> >>>>>>>  
> >>>>>>> -                if (opt_order > 1)
> >>>>>>> -                    quant_cof[1] = -32 * parcor_scaled_values[quant_cof[1] + 64];
> >>>>>>> +                if (bd->opt_order > 1)
> >>>>>>> +                    bd->quant_cof[1] = -32 * parcor_scaled_values[bd->quant_cof[1] + 64];
> >>>>>>>              }
> >>>>>>>  
> >>>>>>> -            for (k = 2; k < opt_order; k++)
> >>>>>>> -                quant_cof[k] = (quant_cof[k] << 14) + (add_base << 13);
> >>>>>>> +            for (k = 2; k < bd->opt_order; k++)
> >>>>>>> +                bd->quant_cof[k] = (bd->quant_cof[k] << 14) + (add_base << 13);
> >>>>>> it might make sense to have commonly used variables like opt_order on the
> >>>>>> stack intead of just accessable through a pointer
> >>>>> Yes.
> >>>>>
> >>>>>> besides does this patch lead to any slowdown?
> >>>>> The benchmark results showed a little slowdown but at least hardly
> >>>>> noticable on the command line...
> >>>>> We had this issue at the soc list. Ended up with me providing some
> >>>>> assembler file grep results for checking inlining done by the compiler.
> >>>> was the slowdown fixed or not?
> >>>> "not" is bad
> >>> It was not. You blamed the many bd-> to be a reason for the slowdown and
> >>> requested a check for wrong inlining before considering other
> >>> possibilities to make this faster. I'm not able to see anything about
> >>> the inlining stuff in assembler files so I provided a grep result you
> >>> asked for.
> >>>
> >>> See:
> >>> https://lists.mplayerhq.hu/pipermail/ffmpeg-soc/2009-October/008364.html
> >>>
> >>> I'm ready to fix that slowdown issue, all I need is a little advise
> >>> about how to... the obvious way to use local variables to store
> >>> bd->something and removing many dereferences that way did not yield a
> >>> significant reduction of the slowdown.
> >> grepping for "call" and removing duplicates gives:
> >>
> >> combined:
> >>         call    _parse_bs_info
> >>         call    _decode_end
> >>         call    _read_block_data
> >>         call    _zero_remaining
> >>
> >> separate:
> >>         call    _parse_bs_info
> >>         call    _decode_end
> >>         call    _decode_var_block_data
> >>         call    _zero_remaining
> >>         call    _read_var_block_data
> >>         call    _decode_blocks_ind
> >>
> >> It seems that decode_blocks_ind is inlined in the combined version but
> >> not in the separate version.
> 
> Thanks for that!
> 
> > 
> > so, does adding always_inline help?
> > (note, expect that adding this to one function will make gcc stop
> >  inlining other unrelated functions so you might need many always_inline
> >  to force gcc not to be silly)
> 
> I've redone all the benchmarking and attached the results in a text
> file. The benchmarked stops time for the call to read_frame_data() in
> decode_frame(). I've printed the functions with av_always_inline, the
> benchmark results and the grep result for calls in the assembler code.
> All done using gcc 4.0.
> 
> Unfortunately, using av_always_inline for many functions (so that the
> same calls within alsdec.c remain) reveals many other functions not
> inlined anymore and therefore seems not to reduce the slowdown. And as
> you said, just av_always_inline'ing decode_blocks_ind lets many other
> functione not being inlined any more...
> 
> Anything else I can test?

There are varous command line switches for gcc that control inlining
also
You could report this to the gcc devels, not sure what that would lead
to but i guess it might be worth a try.

What is certain though is that code split / factorization that reduces
speed by 10% is rejected

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

While the State exists there can be no freedom; when there is freedom there
will be no State. -- Vladimir Lenin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20091125/5cf3e164/attachment.pgp>



More information about the ffmpeg-devel mailing list