[FFmpeg-devel] [PATCH] SSE dct32() [Was: r23095 - in trunk/libavcodec: ...]

Vitor Sessak vitor1001
Sun Jun 20 11:09:20 CEST 2010


On 06/20/2010 12:35 AM, Michael Niedermayer wrote:
> On Sun, Jun 20, 2010 at 12:24:45AM +0200, Vitor Sessak wrote:
>> On 06/20/2010 12:03 AM, Michael Niedermayer wrote:
>>> On Sat, Jun 19, 2010 at 09:25:51PM +0200, Vitor Sessak wrote:
>>>> On 06/19/2010 08:47 PM, Michael Niedermayer wrote:
>>>>> On Fri, Jun 11, 2010 at 11:34:21PM +0200, Vitor Sessak wrote:
>>>>>> On 06/08/2010 04:04 PM, Michael Niedermayer wrote:
>>>>>>> On Tue, Jun 08, 2010 at 12:56:16PM +0200, Vitor Sessak wrote:
>>>>>>>> On 06/08/2010 01:52 AM, Michael Niedermayer wrote:
>>>>>>>>> On Sat, Jun 05, 2010 at 07:35:29AM +0200, Vitor Sessak wrote:
>>>>>>>>>> Moving discussion to -devel...
>>>>>>>>>>
>>>>>>>>>> On 05/31/2010 09:59 PM, Vitor Sessak wrote:
>>>>>>>>>>> On 05/14/2010 05:52 PM, Michael Niedermayer wrote:
>>>>>>>>>>>> On Fri, May 14, 2010 at 08:39:48AM +0200, Vitor Sessak wrote:
>>>>>>>>>>>>> Michael Niedermayer wrote:
>>>>>>>>>>>>>> On Tue, May 11, 2010 at 03:56:45PM -0400, Alex Converse wrote:
>>>>>>>>>>>>>>> On Tue, May 11, 2010 at 3:52 PM,
>>>>>>>>>>>>>>> michael<subversion at mplayerhq.hu>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Author: michael
>>>>>>>>>>>>>>>> Date: Tue May 11 21:52:42 2010
>>>>>>>>>>>>>>>> New Revision: 23095
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Log:
>>>>>>>>>>>>>>>> float based mp1/mp2/mp3 decoders.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>> btw, any volunteers to try to hook it up to our split radix dct
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> simd optimize it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Without rdft or dct simd, our split radix code is slower. Ugly
>>>>>>>>>>>>> hack
>>>>>>>>>>>>> to test
>>>>>>>>>>>>> it attached.
>>>>>>>>>>>>
>>>>>>>>>>>> if dct32() is faster then it should be used by our generic dct
>>>>>>>>>>>> code.
>>>>>>>>>>>> at least for the plain C case
>>>>>>>>>>>
>>>>>>>>>>> I've given a try at a SSE dct32(). It is much faster than current
>>>>>>>>>>> C
>>>>>>>>>>> code. The only problem is that current code in mpegaudiodec.c
>>>>>>>>>>> expect
>>>>>>>>>>> two
>>>>>>>>>>> arguments, one input (which is destructed) and one output. ITOH,
>>>>>>>>>>> ff_dct_calc() does everything in-place, what does not glue well
>>>>>>>>>>> with
>>>>>>>>>>> the
>>>>>>>>>>> current mpegaudiodec.c code. Can you (or anyone else that knows
>>>>>>>>>>> mpegaudiodec.c well) fix it?
>>>>>>>>>>
>>>>>>>>>> I've given a try of making mpegaudiodec.c use the same buffer for
>>>>>>>>>> dct
>>>>>>>>>> input
>>>>>>>>>> and output and it is not trivial. It is much easier (and has no
>>>>>>>>>> measurable
>>>>>>>>>> slowdown) to make ff_dct_calc() take both an input and an output
>>>>>>>>>> pointer
>>>>>>>>>> as
>>>>>>>>>> in attached patch.
>>>>>>>>>>
>>>>>>>>>> -Vitor
>>>>>>>>>
>>>>>>>>>>       avfft.c     |    2 +-
>>>>>>>>>>       binkaudio.c |    2 +-
>>>>>>>>>>       dct.c       |   40 +++++++++++++++++++++++-----------------
>>>>>>>>>>       fft-test.c  |    6 ++----
>>>>>>>>>>       fft.h       |   11 +++++++++--
>>>>>>>>>>       wmavoice.c  |    4 ++--
>>>>>>>>>>       6 files changed, 38 insertions(+), 27 deletions(-)
>>>>>>>>>> 91cf0cde9a50a47a8df3fbd171b35535abe00d16  dct_inout.diff
>>>>>>>>>
>>>>>>>>> ok if tested and no slowdown is confirmed
>>>>>>>>
>>>>>>>> I retested carefully and found a 3% slowdown. It is due to aliasing,
>>>>>>>> which
>>>>>>>> does not allow the compiler to unroll the loops. I tested unrolling
>>>>>>>> by
>>>>>>>> hand
>>>>>>>> the loops and afterwards it is as fast as before.
>>>>>>>>
>>>>>>>> Are you ok with the patch as is or ok if I apply another patch
>>>>>>>> afterwards
>>>>>>>> unrolling the loops?
>>>>>>>
>>>>>>> i think that a 3% speedloss is significant so iam definitly not ok
>>>>>>> with
>>>>>>> something that leads to such speedloss.
>>>>>>>
>>>>>>> also if yu test this patch + unroll against svn, i wonder how
>>>>>>> svn+unroll performs
>>>>>>> as well as what code cache effects the unroll actually has in actual
>>>>>>> use
>>>>>>
>>>>>> Ok, I took some time to test it really careful and I gave up making a
>>>>>> code
>>>>>> as fast as in-place (to begin with, gcc always get register-starved).
>>>>>> So
>>>>>> I
>>>>>> propose the attached patch. At least the faster code can be used by the
>>>>>> common DCT framework and it makes easier to add ASM optimisations.
>>>>>
>>>>> looks like you created a mess
>>>>> The questions are
>>>>> 1. what is the ideal way speedwise to implement this for mp3?
>>>>
>>>> Inlining either the DCT32 C code (or when available the ASM code) in
>>>> mpegaudiodec.c since there is only speed gain and no loss in inlining -
>>>> it
>>>> is called in just one place. BTW, I do not believe that optimizing (simd
>>>> asm or C) my general pre-process->FFT->post-process dct implementation
>>>> might beat the 32-point specific code. Allowing for different
>>>> input/output
>>>> pairs or not is irrelevant for the current C or ASM implementation (nor
>>>> could I imagine any difference for a future one).
>>>
>>> i didnt mean inlining i meant the in!= out
>>
>> I don't remember seeing a big difference _for the dct32 code_ between in ==
>> out and in != out.
>
> now iam confused, i thought the 3% you quoted was about in ==out vs in!= out
> ?

No, the 3% slowdown was when converting our general code (using FFT) to 
have in != out.

-Vitor



More information about the ffmpeg-devel mailing list