[FFmpeg-devel] [PATCH] Split reading and decoding of blocks in ALS

Michael Niedermayer michaelni
Thu Nov 26 18:00:51 CET 2009


On Thu, Nov 26, 2009 at 10:40:56AM +0100, Thilo Borgmann wrote:
> Michael Niedermayer schrieb:
> > On Wed, Nov 25, 2009 at 01:26:57PM +0100, Thilo Borgmann wrote:
> >> Michael Niedermayer schrieb:
> >>> On Mon, Nov 23, 2009 at 06:24:00PM -0500, Justin Ruggles wrote:
> >>>> Thilo Borgmann wrote:
> >>>>
> >>>>> Michael Niedermayer schrieb:
> >>>>>> On Mon, Nov 23, 2009 at 11:28:35AM +0100, Thilo Borgmann wrote:
> >>>>>>> Michael Niedermayer schrieb:
> >>>>>>>> On Sat, Nov 14, 2009 at 02:15:30PM +0100, Thilo Borgmann wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> in order to support multi-channel correlation, reading and decoding of a
> >>>>>>>>> block has to be separated. This patch introduces ALSBockData struct for
> >>>>>>>>> that purpose.
> >>>>>>>> [...]
> >>>>>>>>> +/** Decodes the block data for a constant block
> >>>>>>>>> + */
> >>>>>>>>> +static void decode_const_block_data(ALSDecContext *ctx, ALSBlockData *bd)
> >>>>>>>>> +{
> >>>>>>>>> +    int smp;
> >>>>>>>>> +
> >>>>>>>>>      // write raw samples into buffer
> >>>>>>>>> -    for (k = 0; k < block_length; k++)
> >>>>>>>>> -        raw_samples[k] = const_val;
> >>>>>>>>> +    for (smp = 0; smp < bd->block_length; smp++)
> >>>>>>>>> +        bd->raw_samples[smp] = bd->const_val;
> >>>>>>>>>  }
> >>>>>>>> memset32(dst, val, len){
> >>>>>>>> }
> >>>>>>>> would possibly be faster due to not doing bd-> in the loop
> >>>>>>> Ok will do...
> >>>>>>>
> >>>>>>>
> >>>>>>>> [...]
> >>>>>>>>> @@ -553,115 +579,132 @@
> >>>>>>>>>      }
> >>>>>>>>>  
> >>>>>>>>>      if (get_bits1(gb))
> >>>>>>>>> -        *shift_lsbs = get_bits(gb, 4) + 1;
> >>>>>>>>> +        bd->shift_lsbs = get_bits(gb, 4) + 1;
> >>>>>>>>>  
> >>>>>>>>> -    store_prev_samples = (*js_blocks && raw_other) || *shift_lsbs;
> >>>>>>>>> +    bd->store_prev_samples = (bd->js_blocks && bd->raw_other) || bd->shift_lsbs;
> >>>>>>>>>  
> >>>>>>>>> -
> >>>>>>>>>      if (!sconf->rlslms) {
> >>>>>>>>>          if (sconf->adapt_order) {
> >>>>>>>>> -            int opt_order_length = av_ceil_log2(av_clip((block_length >> 3) - 1,
> >>>>>>>>> +            int opt_order_length = av_ceil_log2(av_clip((bd->block_length >> 3) - 1,
> >>>>>>>>>                                                  2, sconf->max_order + 1));
> >>>>>>>>> -            opt_order            = get_bits(gb, opt_order_length);
> >>>>>>>>> +            bd->opt_order        = get_bits(gb, opt_order_length);
> >>>>>>>>>          } else {
> >>>>>>>>> -            opt_order = sconf->max_order;
> >>>>>>>>> +            bd->opt_order = sconf->max_order;
> >>>>>>>>>          }
> >>>>>>>>>  
> >>>>>>>>> -        if (opt_order) {
> >>>>>>>>> +        if (bd->opt_order) {
> >>>>>>>>>              int add_base;
> >>>>>>>>>  
> >>>>>>>>>              if (sconf->coef_table == 3) {
> >>>>>>>>>                  add_base = 0x7F;
> >>>>>>>>>  
> >>>>>>>>>                  // read coefficient 0
> >>>>>>>>> -                quant_cof[0] = 32 * parcor_scaled_values[get_bits(gb, 7)];
> >>>>>>>>> +                bd->quant_cof[0] = 32 * parcor_scaled_values[get_bits(gb, 7)];
> >>>>>>>>>  
> >>>>>>>>>                  // read coefficient 1
> >>>>>>>>> -                if (opt_order > 1)
> >>>>>>>>> -                    quant_cof[1] = -32 * parcor_scaled_values[get_bits(gb, 7)];
> >>>>>>>>> +                if (bd->opt_order > 1)
> >>>>>>>>> +                    bd->quant_cof[1] = -32 * parcor_scaled_values[get_bits(gb, 7)];
> >>>>>>>>>  
> >>>>>>>>>                  // read coefficients 2 to opt_order
> >>>>>>>>> -                for (k = 2; k < opt_order; k++)
> >>>>>>>>> -                    quant_cof[k] = get_bits(gb, 7);
> >>>>>>>>> +                for (k = 2; k < bd->opt_order; k++)
> >>>>>>>>> +                    bd->quant_cof[k] = get_bits(gb, 7);
> >>>>>>>>>              } else {
> >>>>>>>>>                  int k_max;
> >>>>>>>>>                  add_base = 1;
> >>>>>>>>>  
> >>>>>>>>>                  // read coefficient 0 to 19
> >>>>>>>>> -                k_max = FFMIN(opt_order, 20);
> >>>>>>>>> +                k_max = FFMIN(bd->opt_order, 20);
> >>>>>>>>>                  for (k = 0; k < k_max; k++) {
> >>>>>>>>>                      int rice_param = parcor_rice_table[sconf->coef_table][k][1];
> >>>>>>>>>                      int offset     = parcor_rice_table[sconf->coef_table][k][0];
> >>>>>>>>> -                    quant_cof[k] = decode_rice(gb, rice_param) + offset;
> >>>>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, rice_param) + offset;
> >>>>>>>>>                  }
> >>>>>>>>>  
> >>>>>>>>>                  // read coefficients 20 to 126
> >>>>>>>>> -                k_max = FFMIN(opt_order, 127);
> >>>>>>>>> +                k_max = FFMIN(bd->opt_order, 127);
> >>>>>>>>>                  for (; k < k_max; k++)
> >>>>>>>>> -                    quant_cof[k] = decode_rice(gb, 2) + (k & 1);
> >>>>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, 2) + (k & 1);
> >>>>>>>>>  
> >>>>>>>>>                  // read coefficients 127 to opt_order
> >>>>>>>>> -                for (; k < opt_order; k++)
> >>>>>>>>> -                    quant_cof[k] = decode_rice(gb, 1);
> >>>>>>>>> +                for (; k < bd->opt_order; k++)
> >>>>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, 1);
> >>>>>>>>>  
> >>>>>>>>> -                quant_cof[0] = 32 * parcor_scaled_values[quant_cof[0] + 64];
> >>>>>>>>> +                bd->quant_cof[0] = 32 * parcor_scaled_values[bd->quant_cof[0] + 64];
> >>>>>>>>>  
> >>>>>>>>> -                if (opt_order > 1)
> >>>>>>>>> -                    quant_cof[1] = -32 * parcor_scaled_values[quant_cof[1] + 64];
> >>>>>>>>> +                if (bd->opt_order > 1)
> >>>>>>>>> +                    bd->quant_cof[1] = -32 * parcor_scaled_values[bd->quant_cof[1] + 64];
> >>>>>>>>>              }
> >>>>>>>>>  
> >>>>>>>>> -            for (k = 2; k < opt_order; k++)
> >>>>>>>>> -                quant_cof[k] = (quant_cof[k] << 14) + (add_base << 13);
> >>>>>>>>> +            for (k = 2; k < bd->opt_order; k++)
> >>>>>>>>> +                bd->quant_cof[k] = (bd->quant_cof[k] << 14) + (add_base << 13);
> >>>>>>>> it might make sense to have commonly used variables like opt_order on the
> >>>>>>>> stack intead of just accessable through a pointer
> >>>>>>> Yes.
> >>>>>>>
> >>>>>>>> besides does this patch lead to any slowdown?
> >>>>>>> The benchmark results showed a little slowdown but at least hardly
> >>>>>>> noticable on the command line...
> >>>>>>> We had this issue at the soc list. Ended up with me providing some
> >>>>>>> assembler file grep results for checking inlining done by the compiler.
> >>>>>> was the slowdown fixed or not?
> >>>>>> "not" is bad
> >>>>> It was not. You blamed the many bd-> to be a reason for the slowdown and
> >>>>> requested a check for wrong inlining before considering other
> >>>>> possibilities to make this faster. I'm not able to see anything about
> >>>>> the inlining stuff in assembler files so I provided a grep result you
> >>>>> asked for.
> >>>>>
> >>>>> See:
> >>>>> https://lists.mplayerhq.hu/pipermail/ffmpeg-soc/2009-October/008364.html
> >>>>>
> >>>>> I'm ready to fix that slowdown issue, all I need is a little advise
> >>>>> about how to... the obvious way to use local variables to store
> >>>>> bd->something and removing many dereferences that way did not yield a
> >>>>> significant reduction of the slowdown.
> >>>> grepping for "call" and removing duplicates gives:
> >>>>
> >>>> combined:
> >>>>         call    _parse_bs_info
> >>>>         call    _decode_end
> >>>>         call    _read_block_data
> >>>>         call    _zero_remaining
> >>>>
> >>>> separate:
> >>>>         call    _parse_bs_info
> >>>>         call    _decode_end
> >>>>         call    _decode_var_block_data
> >>>>         call    _zero_remaining
> >>>>         call    _read_var_block_data
> >>>>         call    _decode_blocks_ind
> >>>>
> >>>> It seems that decode_blocks_ind is inlined in the combined version but
> >>>> not in the separate version.
> >> Thanks for that!
> >>
> >>> so, does adding always_inline help?
> >>> (note, expect that adding this to one function will make gcc stop
> >>>  inlining other unrelated functions so you might need many always_inline
> >>>  to force gcc not to be silly)
> >> I've redone all the benchmarking and attached the results in a text
> >> file. The benchmarked stops time for the call to read_frame_data() in
> >> decode_frame(). I've printed the functions with av_always_inline, the
> >> benchmark results and the grep result for calls in the assembler code.
> >> All done using gcc 4.0.
> >>
> >> Unfortunately, using av_always_inline for many functions (so that the
> >> same calls within alsdec.c remain) reveals many other functions not
> >> inlined anymore and therefore seems not to reduce the slowdown. And as
> >> you said, just av_always_inline'ing decode_blocks_ind lets many other
> >> functione not being inlined any more...
> >>
> >> Anything else I can test?
> > 
> > There are varous command line switches for gcc that control inlining
> > also
> > You could report this to the gcc devels, not sure what that would lead
> > to but i guess it might be worth a try.
> > 
> > What is certain though is that code split / factorization that reduces
> > speed by 10% is rejected
> 
> Compilation of aldec.c is done using -O3 which enables the common
> inlining optimizations. One other is untouched by this, -finline-limit.
> I could not figure out what the default value for this is, but I tested
> the seperate version (pure patch) without using av_always_inline
> anywhere. I set it by chance to 4096 and it ran as fast as the combined
> version.
> 
> 8028640 dezicycles in sep, 1 runs, 0 skips
> 9019255 dezicycles in sep, 2 runs, 0 skips
> 12334895 dezicycles in sep, 4 runs, 0 skips
> 15384680 dezicycles in sep, 8 runs, 0 skips
> 15415286 dezicycles in sep, 16 runs, 0 skips
> 15224882 dezicycles in sep, 32 runs, 0 skips
> 15370329 dezicycles in sep, 64 runs, 0 skips
> 
>    2 	call	_decode_blocks_ind
>    4 	call	_decode_end
>   36 	call	_decode_rice
>   10 	call	_get_bits_long
>   11 	call	_parse_bs_info
>    2 	call	_zero_remaining
> 
> The assembler output points out, that decode_blocks_ind() is still not
> inlines without av_always_inline, but {read,decode}_var_block_data() are
> (compare with qouted output above).
> 
> Using -finline_limit with the combined version does not yield any speed
> gain.
> 

> So, is using -finline_limit an option? If so, someone with more
> experience with the gcc might suggest a better values than 4096 for it.

its an option if you can limit it to the file that needs it, global changes
would need alot of testing (nothing wrong with global change if it is
globally faster and people do the testing of many codecs many all cpus and
compilers but thats a big task ...)

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No great genius has ever existed without some touch of madness. -- Aristotle
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20091126/538b98db/attachment.pgp>



More information about the ffmpeg-devel mailing list