[FFmpeg-devel] [PATCH] SSE dct32() [Was: r23095 - in trunk/libavcodec: ...]

Sat Jun 19 20:47:30 CEST 2010

On Fri, Jun 11, 2010 at 11:34:21PM +0200, Vitor Sessak wrote:
> On 06/08/2010 04:04 PM, Michael Niedermayer wrote:
>> On Tue, Jun 08, 2010 at 12:56:16PM +0200, Vitor Sessak wrote:
>>> On 06/08/2010 01:52 AM, Michael Niedermayer wrote:
>>>> On Sat, Jun 05, 2010 at 07:35:29AM +0200, Vitor Sessak wrote:
>>>>> Moving discussion to -devel...
>>>>>
>>>>> On 05/31/2010 09:59 PM, Vitor Sessak wrote:
>>>>>> On 05/14/2010 05:52 PM, Michael Niedermayer wrote:
>>>>>>> On Fri, May 14, 2010 at 08:39:48AM +0200, Vitor Sessak wrote:
>>>>>>>> Michael Niedermayer wrote:
>>>>>>>>> On Tue, May 11, 2010 at 03:56:45PM -0400, Alex Converse wrote:
>>>>>>>>>> On Tue, May 11, 2010 at 3:52 PM, michael<subversion at mplayerhq.hu>
>>>>>>>>>> wrote:
>>>>>>>>>>> Author: michael
>>>>>>>>>>> Date: Tue May 11 21:52:42 2010
>>>>>>>>>>> New Revision: 23095
>>>>>>>>>>>
>>>>>>>>>>> Log:
>>>>>>>>>>> float based mp1/mp2/mp3 decoders.
>>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>> :)
>>>>>>>>> btw, any volunteers to try to hook it up to our split radix dct and
>>>>>>>>> or
>>>>>>>>> simd optimize it?
>>>>>>>>
>>>>>>>> Without rdft or dct simd, our split radix code is slower. Ugly hack
>>>>>>>> to test
>>>>>>>> it attached.
>>>>>>>
>>>>>>> if dct32() is faster then it should be used by our generic dct code.
>>>>>>> at least for the plain C case
>>>>>>
>>>>>> I've given a try at a SSE dct32(). It is much faster than current C
>>>>>> code. The only problem is that current code in mpegaudiodec.c expect 
>>>>>> two
>>>>>> arguments, one input (which is destructed) and one output. ITOH,
>>>>>> ff_dct_calc() does everything in-place, what does not glue well with 
>>>>>> the
>>>>>> current mpegaudiodec.c code. Can you (or anyone else that knows
>>>>>> mpegaudiodec.c well) fix it?
>>>>>
>>>>> I've given a try of making mpegaudiodec.c use the same buffer for dct
>>>>> input
>>>>> and output and it is not trivial. It is much easier (and has no
>>>>> measurable
>>>>> slowdown) to make ff_dct_calc() take both an input and an output 
>>>>> pointer
>>>>> as
>>>>> in attached patch.
>>>>>
>>>>> -Vitor
>>>>
>>>>>    avfft.c     |    2 +-
>>>>>    binkaudio.c |    2 +-
>>>>>    dct.c       |   40 +++++++++++++++++++++++-----------------
>>>>>    fft-test.c  |    6 ++----
>>>>>    fft.h       |   11 +++++++++--
>>>>>    wmavoice.c  |    4 ++--
>>>>>    6 files changed, 38 insertions(+), 27 deletions(-)
>>>>> 91cf0cde9a50a47a8df3fbd171b35535abe00d16  dct_inout.diff
>>>>
>>>> ok if tested and no slowdown is confirmed
>>>
>>> I retested carefully and found a 3% slowdown. It is due to aliasing, 
>>> which
>>> does not allow the compiler to unroll the loops. I tested unrolling by 
>>> hand
>>> the loops and afterwards it is as fast as before.
>>>
>>> Are you ok with the patch as is or ok if I apply another patch afterwards
>>> unrolling the loops?
>>
>> i think that a 3% speedloss is significant so iam definitly not ok with
>> something that leads to such speedloss.
>>
>> also if yu test this patch + unroll against svn, i wonder how
>> svn+unroll performs
>> as well as what code cache effects the unroll actually has in actual use
>
> Ok, I took some time to test it really careful and I gave up making a code 
> as fast as in-place (to begin with, gcc always get register-starved). So I 
> propose the attached patch. At least the faster code can be used by the 
> common DCT framework and it makes easier to add ASM optimisations.

looks like you created a mess
The questions are
1. what is the ideal way speedwise to implement this for mp3?
2. Is what you implement a step toward this implementation
3. If 2. is NO, then how much speed would we loose in the end

The problem is that taking small steps toward correct code is fine even if
the code is incomplete and disabled, but putting a dct() in place that
is wrong from a highlevel optimization point of view and
then optimizing this wrong code in asm is one of the most retarded ways
by which one could waste time. All asm optimisations would be eventually
thrown away once the dct() api is fixed and then we would also have to 
deal with the troll(s) who whine how terrible it is to have a temporary
10% speedloss due to this being fixed and once new asm is in place a 13%
speedup

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Republics decline into democracies and democracies degenerate into
despotisms. -- Aristotle
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100619/48d1534d/attachment.pgp>