[FFmpeg-devel] [PATCH] Move MLP's dot product to DSPContext
Thu Apr 30 14:31:41 CEST 2009
On Wed, Apr 29, 2009 at 9:58 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Wed, Apr 29, 2009 at 01:15:14AM -0300, Ramiro Polla wrote:
>> On Tue, Apr 21, 2009 at 12:31 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Mon, Apr 20, 2009 at 11:01:10PM -0300, Ramiro Polla wrote:
>> >> On Mon, Apr 20, 2009 at 9:40 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> > On Mon, Apr 20, 2009 at 02:29:09AM -0300, Ramiro Polla wrote:
>> >> >> On Mon, Apr 20, 2009 at 12:14 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> >> > On Sun, Apr 19, 2009 at 10:10:05PM -0300, Ramiro Polla wrote:
>> >> >> >> Attached file move MLP's dot product to DSPContext. The filter order
>> >> >> >> is a maximum of 8, and in the rematrix stage it's a maximum of 5+2
>> >> >> >> channels for MLP and 7+0 channels for TrueHD, so it all fits in 8
>> >> >> >> (hopefully) optimized functions.
>> >> >> >
>> >> >> > the functions are too small, the call overhead is too much
>> >> >> > 1-8 multiplicatons and 1-8 additions is not enough ...
>> >> >>
>> >> >> I thought that would happen too, but strangely there was a speedup.
>> >> >
>> >> > you wrote the whole function in asm() and that was slower?
>> >> Attached are three asm variants: sse2, sse4, and altivec.
>> > 1. i meant non SIMD asm :)
>> > If one wanted to do this in SIMD, it should do several channels
>> > at once, or FIR & IIR at once or several blocks at once, then
>> > SIMD should be faster but as is its not SIMD friendly
>> > 2. i mean the whole outer function not the 1-8 arithemtic ops one
>> > this one as said is too small, the call overhead will kill it when
>> > you try the same code (that is asm not gcc deoptiranomized C)
>> > ahh and note, gcc and 64 operations -> very poor code, naive asm
>> > will be much faster at least it was that way in the past ...
>> After some of my latest commits (specially the one that cut down the
>> buffer sizes), mlp in general got faster, but gcc decided to vectorize
>> filter_channels() in x86_64, so the current speeds for 7.1.thd are
>> ppc ? : 9750 ms
>> x86_32: 3585 ms
>> x86_64: 2546 ms
>> x86_64: 2142 ms (-fno-tree-vectorize)
>> After 0001-mlpdec-Move-MLP-s-filter_channel-to-dsputils.patch:
>> ppc ? : 9220 ms
>> x86_32: 3046 ms
>> x86_64: 2504 ms
>> x86_64 with -fno-tree-vectorize was not measured because of a harddisk
>> failure (and I won't be able to test again for a while), but IIRC it
>> was something around 2100 ms.
>> After 0002-mlpdec-x86_-32-64-optimized-mlp_filter_channel.patch:
>> x86_32: 2951 ms
>> x86_64: 1897 ms
>> After 0003-Check-for-SSE4.patch and
>> x86_32: 2985 ms
>> The sse4 code is around the same speed as x86_32 (and slower than
>> x86_64). Unless it can be well optmized further I don't think it's
>> worth spending more time on it. Also I decided to drop the mmx2 code
>> since it was always slower. Unless someone has a good idea to optimize
>> it much more.
>> rematrix_channels() is slower when moved into dsputils, and more
>> annoying to optimize in assembly because of if (matrix_noise_shift).
>> It showed a great speedup when optimized with altivec though. (This
>> part is not done yet).
>> Ramiro Polla - looking for a way to bring a harddisk back to life
>> ?Makefile ?| ? ?4 ++--
>> ?dsputil.c | ? ?8 ++++++++
>> ?dsputil.h | ? ?6 ++++++
>> ?mlpdec.c ?| ? 34 ++++++++++------------------------
>> ?mlpdsp.c ?| ? 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ?5 files changed, 82 insertions(+), 26 deletions(-)
>> 8a0d237b76f1e656c084d0d921fbd92a96fcf02b ?0001-mlpdec-Move-MLP-s-filter_channel-to-dsputils.patch
>> From d2cfa257f0c5a7a4b42314c23d3bdfe4be49d319 Mon Sep 17 00:00:00 2001
>> From: Ramiro Polla <ramiro at lisha.ufsc.br>
>> Date: Tue, 28 Apr 2009 23:54:49 -0300
>> Subject: [PATCH] mlpdec: Move MLP's filter_channel to dsputils.
I'll look at the rest later next week.
More information about the ffmpeg-devel