[FFmpeg-devel] [PATCH] SSE dct32() [Was: r23095 - in trunk/libavcodec: ...]

Vitor Sessak vitor1001
Sat Jun 19 21:25:51 CEST 2010

On 06/19/2010 08:47 PM, Michael Niedermayer wrote:
> On Fri, Jun 11, 2010 at 11:34:21PM +0200, Vitor Sessak wrote:
>> On 06/08/2010 04:04 PM, Michael Niedermayer wrote:
>>> On Tue, Jun 08, 2010 at 12:56:16PM +0200, Vitor Sessak wrote:
>>>> On 06/08/2010 01:52 AM, Michael Niedermayer wrote:
>>>>> On Sat, Jun 05, 2010 at 07:35:29AM +0200, Vitor Sessak wrote:
>>>>>> Moving discussion to -devel...
>>>>>> On 05/31/2010 09:59 PM, Vitor Sessak wrote:
>>>>>>> On 05/14/2010 05:52 PM, Michael Niedermayer wrote:
>>>>>>>> On Fri, May 14, 2010 at 08:39:48AM +0200, Vitor Sessak wrote:
>>>>>>>>> Michael Niedermayer wrote:
>>>>>>>>>> On Tue, May 11, 2010 at 03:56:45PM -0400, Alex Converse wrote:
>>>>>>>>>>> On Tue, May 11, 2010 at 3:52 PM, michael<subversion at mplayerhq.hu>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Author: michael
>>>>>>>>>>>> Date: Tue May 11 21:52:42 2010
>>>>>>>>>>>> New Revision: 23095
>>>>>>>>>>>> Log:
>>>>>>>>>>>> float based mp1/mp2/mp3 decoders.
>>>>>>>>>>> Thanks
>>>>>>>>>> :)
>>>>>>>>>> btw, any volunteers to try to hook it up to our split radix dct and
>>>>>>>>>> or
>>>>>>>>>> simd optimize it?
>>>>>>>>> Without rdft or dct simd, our split radix code is slower. Ugly hack
>>>>>>>>> to test
>>>>>>>>> it attached.
>>>>>>>> if dct32() is faster then it should be used by our generic dct code.
>>>>>>>> at least for the plain C case
>>>>>>> I've given a try at a SSE dct32(). It is much faster than current C
>>>>>>> code. The only problem is that current code in mpegaudiodec.c expect
>>>>>>> two
>>>>>>> arguments, one input (which is destructed) and one output. ITOH,
>>>>>>> ff_dct_calc() does everything in-place, what does not glue well with
>>>>>>> the
>>>>>>> current mpegaudiodec.c code. Can you (or anyone else that knows
>>>>>>> mpegaudiodec.c well) fix it?
>>>>>> I've given a try of making mpegaudiodec.c use the same buffer for dct
>>>>>> input
>>>>>> and output and it is not trivial. It is much easier (and has no
>>>>>> measurable
>>>>>> slowdown) to make ff_dct_calc() take both an input and an output
>>>>>> pointer
>>>>>> as
>>>>>> in attached patch.
>>>>>> -Vitor
>>>>>>     avfft.c     |    2 +-
>>>>>>     binkaudio.c |    2 +-
>>>>>>     dct.c       |   40 +++++++++++++++++++++++-----------------
>>>>>>     fft-test.c  |    6 ++----
>>>>>>     fft.h       |   11 +++++++++--
>>>>>>     wmavoice.c  |    4 ++--
>>>>>>     6 files changed, 38 insertions(+), 27 deletions(-)
>>>>>> 91cf0cde9a50a47a8df3fbd171b35535abe00d16  dct_inout.diff
>>>>> ok if tested and no slowdown is confirmed
>>>> I retested carefully and found a 3% slowdown. It is due to aliasing,
>>>> which
>>>> does not allow the compiler to unroll the loops. I tested unrolling by
>>>> hand
>>>> the loops and afterwards it is as fast as before.
>>>> Are you ok with the patch as is or ok if I apply another patch afterwards
>>>> unrolling the loops?
>>> i think that a 3% speedloss is significant so iam definitly not ok with
>>> something that leads to such speedloss.
>>> also if yu test this patch + unroll against svn, i wonder how
>>> svn+unroll performs
>>> as well as what code cache effects the unroll actually has in actual use
>> Ok, I took some time to test it really careful and I gave up making a code
>> as fast as in-place (to begin with, gcc always get register-starved). So I
>> propose the attached patch. At least the faster code can be used by the
>> common DCT framework and it makes easier to add ASM optimisations.
> looks like you created a mess
> The questions are
> 1. what is the ideal way speedwise to implement this for mp3?

Inlining either the DCT32 C code (or when available the ASM code) in 
mpegaudiodec.c since there is only speed gain and no loss in inlining - 
it is called in just one place. BTW, I do not believe that optimizing 
(simd asm or C) my general pre-process->FFT->post-process dct 
implementation might beat the 32-point specific code. Allowing for 
different input/output pairs or not is irrelevant for the current C or 
ASM implementation (nor could I imagine any difference for a future one).

> 2. Is what you implement a step toward this implementation


> 3. If 2. is NO, then how much speed would we loose in the end

A supplementary load of a function pointer and a call.

I'm pretty sure you also care about the following questions:

3. Does the patch introduce some enormous code duplication (having a 
float and a fixed-point copies of the dct32() C implementation)?
4. Is this much faster dct usable from our dct framework?
5. How messy is the code?

> The problem is that taking small steps toward correct code is fine even if
> the code is incomplete and disabled, but putting a dct() in place that
> is wrong from a highlevel optimization point of view and
> then optimizing this wrong code in asm is one of the most retarded ways
> by which one could waste time. All asm optimisations would be eventually
> thrown away once the dct() api is fixed and then we would also have to
> deal with the troll(s) who whine how terrible it is to have a temporary
> 10% speedloss due to this being fixed and once new asm is in place a 13%
> speedup

My 3% speed loss to make input != output in dct.c was not to allow some 
future speed gain but to allow code to be less messy (no special 
function pointer to the 32-point dct).


More information about the ffmpeg-devel mailing list