[FFmpeg-devel] [PATCH] Move MLP's dot product to DSPContext

Fri May 15 17:39:29 CEST 2009

On Fri, May 15, 2009 at 12:11 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Wed, May 13, 2009 at 05:03:03PM -0300, Ramiro Polla wrote:
>> Hi,
>>
>> On Wed, Apr 29, 2009 at 9:58 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Wed, Apr 29, 2009 at 01:15:14AM -0300, Ramiro Polla wrote:
>> [...]
>> >> +void ff_mlp_filter_channel_x86_64(int32_t *firbuf, const int32_t *fircoeff, int firorder,
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int32_t *iirbuf, const int32_t *iircoeff, int iirorder,
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned int filter_shift, int32_t mask, int blocksize,
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int32_t *sample_buffer)
>> >> +{
>> >> + ? ?void *firjump = ff_mlp_firtable_x86_64[firorder];
>> >> + ? ?void *iirjump = ff_mlp_iirtable_x86_64[iirorder];
>> >> +
>> >> + ? ?blocksize = -blocksize;
>> >> +
>> >> + ? ?__asm__ volatile(
>> >> + ? ? ? ?"1: ? ? ? ? ? ? ? ? ? ? ? ?\n\t"
>> >> + ? ? ? ?"xor ? ? %%rsi ? ? ?, %%rsi\n\t"
>> >> + ? ? ? ?"jmp ? ?*%[firjump] ? ? ? ?\n\t"
>> >> + ? ? ? ?MUL64("%[firbuf]", "%[fircoeff]", 0x1c, ff_mlp_firorder_x86_64_8)
>> >> + ? ? ? ?MUL64("%[firbuf]", "%[fircoeff]", 0x18, ff_mlp_firorder_x86_64_7)
>> >> + ? ? ? ?MUL64("%[firbuf]", "%[fircoeff]", 0x14, ff_mlp_firorder_x86_64_6)
>> >> + ? ? ? ?MUL64("%[firbuf]", "%[fircoeff]", 0x10, ff_mlp_firorder_x86_64_5)
>> >> + ? ? ? ?MUL64("%[firbuf]", "%[fircoeff]", 0x0c, ff_mlp_firorder_x86_64_4)
>> >> + ? ? ? ?MUL64("%[firbuf]", "%[fircoeff]", 0x08, ff_mlp_firorder_x86_64_3)
>> >> + ? ? ? ?MUL64("%[firbuf]", "%[fircoeff]", 0x04, ff_mlp_firorder_x86_64_2)
>> >> + ? ? ? ?MUL64("%[firbuf]", "%[fircoeff]", 0x00, ff_mlp_firorder_x86_64_1)
>> >> + ? ? ? ?MANGLE(ff_mlp_firorder_x86_64_0)":\n\t"
>> >> + ? ? ? ?"jmp ? ?*%[iirjump] ? ? ? ?\n\t"
>> >> + ? ? ? ?MUL64("%[iirbuf]", "%[iircoeff]", 0x0c, ff_mlp_iirorder_x86_64_4)
>> >> + ? ? ? ?MUL64("%[iirbuf]", "%[iircoeff]", 0x08, ff_mlp_iirorder_x86_64_3)
>> >> + ? ? ? ?MUL64("%[iirbuf]", "%[iircoeff]", 0x04, ff_mlp_iirorder_x86_64_2)
>> >> + ? ? ? ?MUL64("%[iirbuf]", "%[iircoeff]", 0x00, ff_mlp_iirorder_x86_64_1)
>> >
>> > you probably could put some of the coeffs in registers
>>
>> Added the 3 first FIR coeffs until gcc started complaining that there
>> were no more free regs.
>>
>> >> + ? ? ? ?MANGLE(ff_mlp_iirorder_x86_64_0)":\n\t"
>> >
>> >> + ? ? ? ?"mov ? ? %%rsi ? ? ?, %%rax\n\t"
>> >
>> > useless
>>
>> Removed.
>>
>> >> + ? ? ? ?"shr ? ? %%cl ? ? ? , %%rax\n\t"
>> >> +
>> >> + ? ? ? ?"mov ? ? %%rax ? ? ?, %%rdx\n\t"
>> >> + ? ? ? ?"add ? ?(%[sample]) , %%rax\n\t"
>> >> + ? ? ? ?"and ? ? %[mask] ? ?, %%rax\n\t"
>> >> + ? ? ? ?"sub ? ? ? ? ? ? ?$4, ?%[firbuf]\n\t"
>> >> + ? ? ? ?"sub ? ? ? ? ? ? ?$4, ?%[iirbuf]\n\t"
>> >
>> > these 2 buffers can apparently be merged simplifying addressing
>>
>> Merged, and coeffs too.
>>
>> >> + ? ? ? ?"mov ? ? %%eax ? ? ?, (%[firbuf])\n\t"
>> >> + ? ? ? ?"mov ? ? %%eax ? ? ?, (%[sample])\n\t"
>> >
>> > this looks mildly redundant ...
>>
>> I tried removing firbuf and instead using *sample directly, but this
>> led to slower code.
>>
>> I also tried switching sample_buffer from
>> [MAX_BLOCKSIZE][MAX_CHANNELS] to [MAX_CHANNELS][MAX_BLOCKSIZE] so that
>> I could access the members more closely, but this also led to slower
>> code overall.
>>
>> I renamed the MUL macros as per Mans' suggestion, and reworked most of
>> the asm code (32-bit now has keeps some pointers in registers and is
>> much faster). I also removed the attempt to manually schedule MUL32
>> because it led to uglier code and Dark_Shikari suggested it wouldn't
>> do much good because of out-of-order execution anyways.
>>
>> Order of patches:
>> include_mlp_h.diff
>> join_states_coeffs.diff
>> x86_filter.diff
>>
>> speedup:
>> 32-bit: 12.59%
>> 64-bit: ?9.98%
>>
>> I haven't pursued sse4 anymore because the x86_32 code is very close
>> in speed, and I have other work to do.
>>
>> Ramiro Polla
>
>> ?mlpdsp.c | ? ?3 ++-
>> ?1 file changed, 2 insertions(+), 1 deletion(-)
>> ea3cc210c99e4f980a38ff26342846adae7e7dd6 ?include_mlp_h.diff
>
> ok

Applied.

> [...]
>
>> ?dsputil.h | ? ?4 ++--
>> ?mlp.h ? ? | ? ?2 +-
>> ?mlpdec.c ?| ? 15 ++++++++-------
>> ?mlpdsp.c ?| ? ?8 ++++++--
>> ?4 files changed, 17 insertions(+), 12 deletions(-)
>> b4a586612c90d2e5430ac416f16dbaa12f282383 ?join_states_coeffs.diff
>
> ok

Applied.