[FFmpeg-devel] [PATCH] Move MLP's dot product to DSPContext

Wed Apr 22 09:01:26 CEST 2009

On Mon, Apr 20, 2009 at 8:32 PM, Ramiro Polla <ramiro.polla at gmail.com> wrote:
> On Tue, Apr 21, 2009 at 12:29 AM, Jason Garrett-Glaser
> <darkshikari at gmail.com> wrote:
>> 2009/4/20 Ramiro Polla <ramiro.polla at gmail.com>:
>>> On Mon, Apr 20, 2009 at 9:40 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>>> On Mon, Apr 20, 2009 at 02:29:09AM -0300, Ramiro Polla wrote:
>>>>> On Mon, Apr 20, 2009 at 12:14 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>>>> > On Sun, Apr 19, 2009 at 10:10:05PM -0300, Ramiro Polla wrote:
>>>>> >> Attached file move MLP's dot product to DSPContext. The filter order
>>>>> >> is a maximum of 8, and in the rematrix stage it's a maximum of 5+2
>>>>> >> channels for MLP and 7+0 channels for TrueHD, so it all fits in 8
>>>>> >> (hopefully) optimized functions.
>>>>> >
>>>>> > the functions are too small, the call overhead is too much
>>>>> > 1-8 multiplicatons and 1-8 additions is not enough ...
>>>>>
>>>>> I thought that would happen too, but strangely there was a speedup.
>>>>
>>>> you wrote the whole function in asm() and that was slower?
>>>
>>> Attached are three asm variants: sse2, sse4, and altivec.
>>>
>>> Here are the benchmarks:
>
> [...]
>
>>> - on x86_64 (can't run sse4)
>>> current: ?2070ms
>>> array of functions in dspcontext:
>>> c ? ? ?: ?2600ms (badly vectorized)
>>> c ? ? ?: ?1920ms (not vectorized)
>>> sse2 ? : ?2450ms
>>> inlined in mlpdec.c:
>>> c ? ? ?: ?2800ms (badly vectorized)
>>> c ? ? ?: ?1980ms (not vectorized)
>>> sse2 ? : ?2450ms
>>
>> Have you tried benching it on a 64-bit system with SSE4?
>
> No. I don't have access to any.

I have a strong suspicion that C code on 64-bit will outperform your
SSE4 loop because of the ability to fit the results in single 64-bit
registers.

Dark Shikari