[FFmpeg-devel] [PATCH] SSE dct32() [Was: r23095 - in trunk/libavcodec: ...]

Sun Jun 20 00:03:47 CEST 2010

On Sat, Jun 19, 2010 at 09:25:51PM +0200, Vitor Sessak wrote:
> On 06/19/2010 08:47 PM, Michael Niedermayer wrote:
>> On Fri, Jun 11, 2010 at 11:34:21PM +0200, Vitor Sessak wrote:
>>> On 06/08/2010 04:04 PM, Michael Niedermayer wrote:
>>>> On Tue, Jun 08, 2010 at 12:56:16PM +0200, Vitor Sessak wrote:
>>>>> On 06/08/2010 01:52 AM, Michael Niedermayer wrote:
>>>>>> On Sat, Jun 05, 2010 at 07:35:29AM +0200, Vitor Sessak wrote:
>>>>>>> Moving discussion to -devel...
>>>>>>>
>>>>>>> On 05/31/2010 09:59 PM, Vitor Sessak wrote:
>>>>>>>> On 05/14/2010 05:52 PM, Michael Niedermayer wrote:
>>>>>>>>> On Fri, May 14, 2010 at 08:39:48AM +0200, Vitor Sessak wrote:
>>>>>>>>>> Michael Niedermayer wrote:
>>>>>>>>>>> On Tue, May 11, 2010 at 03:56:45PM -0400, Alex Converse wrote:
>>>>>>>>>>>> On Tue, May 11, 2010 at 3:52 PM, 
>>>>>>>>>>>> michael<subversion at mplayerhq.hu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Author: michael
>>>>>>>>>>>>> Date: Tue May 11 21:52:42 2010
>>>>>>>>>>>>> New Revision: 23095
>>>>>>>>>>>>>
>>>>>>>>>>>>> Log:
>>>>>>>>>>>>> float based mp1/mp2/mp3 decoders.
>>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>> :)
>>>>>>>>>>> btw, any volunteers to try to hook it up to our split radix dct 
>>>>>>>>>>> and
>>>>>>>>>>> or
>>>>>>>>>>> simd optimize it?
>>>>>>>>>>
>>>>>>>>>> Without rdft or dct simd, our split radix code is slower. Ugly 
>>>>>>>>>> hack
>>>>>>>>>> to test
>>>>>>>>>> it attached.
>>>>>>>>>
>>>>>>>>> if dct32() is faster then it should be used by our generic dct 
>>>>>>>>> code.
>>>>>>>>> at least for the plain C case
>>>>>>>>
>>>>>>>> I've given a try at a SSE dct32(). It is much faster than current C
>>>>>>>> code. The only problem is that current code in mpegaudiodec.c expect
>>>>>>>> two
>>>>>>>> arguments, one input (which is destructed) and one output. ITOH,
>>>>>>>> ff_dct_calc() does everything in-place, what does not glue well with
>>>>>>>> the
>>>>>>>> current mpegaudiodec.c code. Can you (or anyone else that knows
>>>>>>>> mpegaudiodec.c well) fix it?
>>>>>>>
>>>>>>> I've given a try of making mpegaudiodec.c use the same buffer for dct
>>>>>>> input
>>>>>>> and output and it is not trivial. It is much easier (and has no
>>>>>>> measurable
>>>>>>> slowdown) to make ff_dct_calc() take both an input and an output
>>>>>>> pointer
>>>>>>> as
>>>>>>> in attached patch.
>>>>>>>
>>>>>>> -Vitor
>>>>>>
>>>>>>>     avfft.c     |    2 +-
>>>>>>>     binkaudio.c |    2 +-
>>>>>>>     dct.c       |   40 +++++++++++++++++++++++-----------------
>>>>>>>     fft-test.c  |    6 ++----
>>>>>>>     fft.h       |   11 +++++++++--
>>>>>>>     wmavoice.c  |    4 ++--
>>>>>>>     6 files changed, 38 insertions(+), 27 deletions(-)
>>>>>>> 91cf0cde9a50a47a8df3fbd171b35535abe00d16  dct_inout.diff
>>>>>>
>>>>>> ok if tested and no slowdown is confirmed
>>>>>
>>>>> I retested carefully and found a 3% slowdown. It is due to aliasing,
>>>>> which
>>>>> does not allow the compiler to unroll the loops. I tested unrolling by
>>>>> hand
>>>>> the loops and afterwards it is as fast as before.
>>>>>
>>>>> Are you ok with the patch as is or ok if I apply another patch 
>>>>> afterwards
>>>>> unrolling the loops?
>>>>
>>>> i think that a 3% speedloss is significant so iam definitly not ok with
>>>> something that leads to such speedloss.
>>>>
>>>> also if yu test this patch + unroll against svn, i wonder how
>>>> svn+unroll performs
>>>> as well as what code cache effects the unroll actually has in actual use
>>>
>>> Ok, I took some time to test it really careful and I gave up making a 
>>> code
>>> as fast as in-place (to begin with, gcc always get register-starved). So 
>>> I
>>> propose the attached patch. At least the faster code can be used by the
>>> common DCT framework and it makes easier to add ASM optimisations.
>>
>> looks like you created a mess
>> The questions are
>> 1. what is the ideal way speedwise to implement this for mp3?
>
> Inlining either the DCT32 C code (or when available the ASM code) in 
> mpegaudiodec.c since there is only speed gain and no loss in inlining - it 
> is called in just one place. BTW, I do not believe that optimizing (simd 
> asm or C) my general pre-process->FFT->post-process dct implementation 
> might beat the 32-point specific code. Allowing for different input/output 
> pairs or not is irrelevant for the current C or ASM implementation (nor 
> could I imagine any difference for a future one).

i didnt mean inlining i meant the in!= out which leads to
128 byte data cache difference,1 additional pointer to init and 1 register
less

>
>> 2. Is what you implement a step toward this implementation
>
> No.
>
>> 3. If 2. is NO, then how much speed would we loose in the end
>
> A supplementary load of a function pointer and a call.
>
> I'm pretty sure you also care about the following questions:
>
> 3. Does the patch introduce some enormous code duplication (having a float 
> and a fixed-point copies of the dct32() C implementation)?

i didnt mean that it should be inline
the in != out leads to duplication though as the existing code is in==out

> 4. Is this much faster dct usable from our dct framework?

is the in!=out useable? or does it need an additional wraper function
or duplication in the object file?

> 5. How messy is the code?

inlining would be messy like this, it wasnt what i meant though

>
>> The problem is that taking small steps toward correct code is fine even if
>> the code is incomplete and disabled, but putting a dct() in place that
>> is wrong from a highlevel optimization point of view and
>> then optimizing this wrong code in asm is one of the most retarded ways
>> by which one could waste time. All asm optimisations would be eventually
>> thrown away once the dct() api is fixed and then we would also have to
>> deal with the troll(s) who whine how terrible it is to have a temporary
>> 10% speedloss due to this being fixed and once new asm is in place a 13%
>> speedup
>
> My 3% speed loss to make input != output in dct.c was not to allow some 
> future speed gain but to allow code to be less messy (no special function 
> pointer to the 32-point dct).

i dont think i understand what you say, and it seems you misunderstand what
i meant

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Avoid a single point of failure, be that a person or equipment.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100620/d00e7625/attachment.pgp>