[FFmpeg-devel] [PATCH] SSE dct32() [Was: r23095 - in trunk/libavcodec: ...]

Vitor Sessak vitor1001
Sun Jun 20 00:24:45 CEST 2010


On 06/20/2010 12:03 AM, Michael Niedermayer wrote:
> On Sat, Jun 19, 2010 at 09:25:51PM +0200, Vitor Sessak wrote:
>> On 06/19/2010 08:47 PM, Michael Niedermayer wrote:
>>> On Fri, Jun 11, 2010 at 11:34:21PM +0200, Vitor Sessak wrote:
>>>> On 06/08/2010 04:04 PM, Michael Niedermayer wrote:
>>>>> On Tue, Jun 08, 2010 at 12:56:16PM +0200, Vitor Sessak wrote:
>>>>>> On 06/08/2010 01:52 AM, Michael Niedermayer wrote:
>>>>>>> On Sat, Jun 05, 2010 at 07:35:29AM +0200, Vitor Sessak wrote:
>>>>>>>> Moving discussion to -devel...
>>>>>>>>
>>>>>>>> On 05/31/2010 09:59 PM, Vitor Sessak wrote:
>>>>>>>>> On 05/14/2010 05:52 PM, Michael Niedermayer wrote:
>>>>>>>>>> On Fri, May 14, 2010 at 08:39:48AM +0200, Vitor Sessak wrote:
>>>>>>>>>>> Michael Niedermayer wrote:
>>>>>>>>>>>> On Tue, May 11, 2010 at 03:56:45PM -0400, Alex Converse wrote:
>>>>>>>>>>>>> On Tue, May 11, 2010 at 3:52 PM,
>>>>>>>>>>>>> michael<subversion at mplayerhq.hu>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Author: michael
>>>>>>>>>>>>>> Date: Tue May 11 21:52:42 2010
>>>>>>>>>>>>>> New Revision: 23095
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Log:
>>>>>>>>>>>>>> float based mp1/mp2/mp3 decoders.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>> :)
>>>>>>>>>>>> btw, any volunteers to try to hook it up to our split radix dct
>>>>>>>>>>>> and
>>>>>>>>>>>> or
>>>>>>>>>>>> simd optimize it?
>>>>>>>>>>>
>>>>>>>>>>> Without rdft or dct simd, our split radix code is slower. Ugly
>>>>>>>>>>> hack
>>>>>>>>>>> to test
>>>>>>>>>>> it attached.
>>>>>>>>>>
>>>>>>>>>> if dct32() is faster then it should be used by our generic dct
>>>>>>>>>> code.
>>>>>>>>>> at least for the plain C case
>>>>>>>>>
>>>>>>>>> I've given a try at a SSE dct32(). It is much faster than current C
>>>>>>>>> code. The only problem is that current code in mpegaudiodec.c expect
>>>>>>>>> two
>>>>>>>>> arguments, one input (which is destructed) and one output. ITOH,
>>>>>>>>> ff_dct_calc() does everything in-place, what does not glue well with
>>>>>>>>> the
>>>>>>>>> current mpegaudiodec.c code. Can you (or anyone else that knows
>>>>>>>>> mpegaudiodec.c well) fix it?
>>>>>>>>
>>>>>>>> I've given a try of making mpegaudiodec.c use the same buffer for dct
>>>>>>>> input
>>>>>>>> and output and it is not trivial. It is much easier (and has no
>>>>>>>> measurable
>>>>>>>> slowdown) to make ff_dct_calc() take both an input and an output
>>>>>>>> pointer
>>>>>>>> as
>>>>>>>> in attached patch.
>>>>>>>>
>>>>>>>> -Vitor
>>>>>>>
>>>>>>>>      avfft.c     |    2 +-
>>>>>>>>      binkaudio.c |    2 +-
>>>>>>>>      dct.c       |   40 +++++++++++++++++++++++-----------------
>>>>>>>>      fft-test.c  |    6 ++----
>>>>>>>>      fft.h       |   11 +++++++++--
>>>>>>>>      wmavoice.c  |    4 ++--
>>>>>>>>      6 files changed, 38 insertions(+), 27 deletions(-)
>>>>>>>> 91cf0cde9a50a47a8df3fbd171b35535abe00d16  dct_inout.diff
>>>>>>>
>>>>>>> ok if tested and no slowdown is confirmed
>>>>>>
>>>>>> I retested carefully and found a 3% slowdown. It is due to aliasing,
>>>>>> which
>>>>>> does not allow the compiler to unroll the loops. I tested unrolling by
>>>>>> hand
>>>>>> the loops and afterwards it is as fast as before.
>>>>>>
>>>>>> Are you ok with the patch as is or ok if I apply another patch
>>>>>> afterwards
>>>>>> unrolling the loops?
>>>>>
>>>>> i think that a 3% speedloss is significant so iam definitly not ok with
>>>>> something that leads to such speedloss.
>>>>>
>>>>> also if yu test this patch + unroll against svn, i wonder how
>>>>> svn+unroll performs
>>>>> as well as what code cache effects the unroll actually has in actual use
>>>>
>>>> Ok, I took some time to test it really careful and I gave up making a
>>>> code
>>>> as fast as in-place (to begin with, gcc always get register-starved). So
>>>> I
>>>> propose the attached patch. At least the faster code can be used by the
>>>> common DCT framework and it makes easier to add ASM optimisations.
>>>
>>> looks like you created a mess
>>> The questions are
>>> 1. what is the ideal way speedwise to implement this for mp3?
>>
>> Inlining either the DCT32 C code (or when available the ASM code) in
>> mpegaudiodec.c since there is only speed gain and no loss in inlining - it
>> is called in just one place. BTW, I do not believe that optimizing (simd
>> asm or C) my general pre-process->FFT->post-process dct implementation
>> might beat the 32-point specific code. Allowing for different input/output
>> pairs or not is irrelevant for the current C or ASM implementation (nor
>> could I imagine any difference for a future one).
>
> i didnt mean inlining i meant the in!= out

I don't remember seeing a big difference _for the dct32 code_ between in 
== out and in != out.

> which leads to
> 128 byte data cache difference,

Might be significant for some systems, but my patch does not get in the 
way of passing the same pointer as input and output.

> 1 additional pointer to init and

True, but not a very slow op.

> 1 register less

Yes, but this code uses only two pointers and float vars, the compiler 
should be very stupid to get register-starved.

I'll do some proper benchmarking tomorrow.

-Vitor



More information about the ffmpeg-devel mailing list