[FFmpeg-devel] [PATCH] Optimization of AMR NB and WB decoders for MIPS

Mon May 28 17:45:21 CEST 2012

>>>>> If you do such unrolling, please say in the documentation of the
>>>>> function (in acelp_filter.h) that n should be a multiple of eight or
>>>>> any other thing that it requires. Also, please note that this code is
>>>>> not only used for AMR, but also for other vocoders. Can you check if
>>>>> the other uses respect the multiple-of-eight constraint?
>>>> We checked if unrolling can be done (regarding multiple-of-eight constraint) in all
>>>> the places where this function is called. The same goes for all the function where
>>>> we done loop unrolling.
>>>> Comment will be put in documentation of the function. Should I specify that the
>>>> comment is just for MIPS optimizations?
>>>>
>>>>> Hmm, am I missing something or you are doing all this to save a single
>>>>> pre-calculation of (1 + lsp[lp_order - 1])?
>>>> You are correct. We decided to optimize this function (in respect to our profiling
>>>> results)and this was the only meaningful thing to do with our instruction set. We got
>>>> some speed up on this function and decided to leave it in the code taking in account
>>>> document optimization.txt (section: When is an optimization justified? ).
>>>
>>> Can you attach the ASM compiler output of the C version and yours? I
>>> don't really understand why your is faster. Maybe the compiler is to
>>> dump to see that lsp and qa does not alias...
>> In attached files lsp_c.S and lsp_mips.S are ASM compiler outputs for
>> C version and our version respectively.
>> I didn't explained main reason for this optimization correctly. As you can
>> see in ASM outputs, compiler doesn't use FP MADD instructions although it can
>> benefit from them in this case. That is the main reason for this optimization.
>
>How can does exactly ff_acelp_lspd2lpc() benefit from FP MADD? By saving 
>a single add in every call of the function? Unless madds are faster than 
>plain muls, it doesn't looks like a benchmarkable gain. I'd say it has 
>more to do with allowing the compiler to inline the call to the mips 
>version of ff_lsp2polyf()...
>
>Can you do a test? Copy the unmodified C version of ff_acelp_lspd2lpc to 
>mips/amrwb_lsp2lpc.h and see if it is any slower than the madd version.
>
>-Vitor
You are correct. The gain is insignificant and I'll remove this code from new patch.

-Nedeljko
________________________________________
From: Vitor Sessak [vitor1001 at gmail.com]
Sent: Friday, May 25, 2012 22:46
To: Babic, Nedeljko
Cc: FFmpeg development discussions and patches; Lukac, Zeljko
Subject: Re: [FFmpeg-devel] [PATCH] Optimization of AMR NB and WB decoders for MIPS

On 05/24/2012 06:30 PM, Babic, Nedeljko wrote:
>>> Hi Vitor,
>>>
>>>> av_always_inline in a .c file? I don't think it can work...
>>> you are right. I'll change this.
>>>
>>>> Here and in several other places: in general, it is a good idea to do
>>>> your loops also in ASM. That way, you can be sure that the compiler
>>>> cannot slow-down a otherwise good code.
>>> Good point. However, compilers are generally doing good job optimizing this
>>> kind of code for MIPS. Anyhow, loops are left in C only in places where
>>> compilers can’t do a lot of damage, or where it would be too costly to do the
>>> loops in ASM (for example because of constraint on number of named registers
>>> that can be used in inline assembly)
>>
>> Would you mind compiling these files with -S and send the ASM output?
>> One useless instruction in a inner loop can be rather costly.
> In attached file acelp_filters_mips.S you can see what compiler generates for loop
> in question.
> Compiler adds only counter increment in the loop: "addiu      $11,$11,1".
> Taking in consideration that architecture for which this was optimized has dual
> out-of-order instruction issue and that FPU has separate in-order dual-issue pipeline
> decoupled from integer pipeline this doesn't bring cost in performance.
> On the other hand I have to admit that for older architectures it would be better to
> write entire loop in ASM and counter incrementing would be eliminated.
> So my question is do you want me to rewrite loops in ASM for cases like this?

Indeed, it looks like the compiler do a mostly good job. So I'm fine
with leaving the loops in C.

>>>> If you do such unrolling, please say in the documentation of the
>>>> function (in acelp_filter.h) that n should be a multiple of eight or
>>>> any other thing that it requires. Also, please note that this code is
>>>> not only used for AMR, but also for other vocoders. Can you check if
>>>> the other uses respect the multiple-of-eight constraint?
>>> We checked if unrolling can be done (regarding multiple-of-eight constraint) in all
>>> the places where this function is called. The same goes for all the function where
>>> we done loop unrolling.
>>> Comment will be put in documentation of the function. Should I specify that the
>>> comment is just for MIPS optimizations?
>>>
>>>> Hmm, am I missing something or you are doing all this to save a single
>>>> pre-calculation of (1 + lsp[lp_order - 1])?
>>> You are correct. We decided to optimize this function (in respect to our profiling
>>> results)and this was the only meaningful thing to do with our instruction set. We got
>>> some speed up on this function and decided to leave it in the code taking in account
>>> document optimization.txt (section: When is an optimization justified? ).
>>
>> Can you attach the ASM compiler output of the C version and yours? I
>> don't really understand why your is faster. Maybe the compiler is to
>> dump to see that lsp and qa does not alias...
> In attached files lsp_c.S and lsp_mips.S are ASM compiler outputs for
> C version and our version respectively.
> I didn't explained main reason for this optimization correctly. As you can
> see in ASM outputs, compiler doesn't use FP MADD instructions although it can
> benefit from them in this case. That is the main reason for this optimization.

How can does exactly ff_acelp_lspd2lpc() benefit from FP MADD? By saving
a single add in every call of the function? Unless madds are faster than
plain muls, it doesn't looks like a benchmarkable gain. I'd say it has
more to do with allowing the compiler to inline the call to the mips
version of ff_lsp2polyf()...

Can you do a test? Copy the unmodified C version of ff_acelp_lspd2lpc to
mips/amrwb_lsp2lpc.h and see if it is any slower than the madd version.

-Vitor