[FFmpeg-devel] [PATCH v3] SH4 mpegaudio decoder optimizations

Guennadi Liakhovetski g.liakhovetski
Thu Jan 20 09:55:58 CET 2011


This has been a long delay...

On Mon, 27 Dec 2010, Michael Niedermayer wrote:

> On Sun, Dec 26, 2010 at 07:49:29PM +0100, Guennadi Liakhovetski wrote:
> > On Sun, 26 Dec 2010, Michael Niedermayer wrote:
> > 
> > > On Fri, Dec 17, 2010 at 05:54:56PM +0100, Guennadi Liakhovetski wrote:
> > > > Hi all
> > > > 
> > > > This is version 3 of the following patch:
> > > > 
> > > > This patch implements several mpegaudio optimizations for SH4 FPU-enabled 
> > > > SoCs. Verified to provide more than 40% acceleration, when decoding a 
> > > > 128kbps stereo mp3 audio file to /dev/null.
> > > > 
> > > > Changes listed in the patch header, thanks to all who commented!
> > > > 
> > > > Thanks
> > > > Guennadi
> > > > ---
> > > > Guennadi Liakhovetski, Ph.D.
> > > > Freelance Open-Source Software Developer
> > > > http://www.open-technology.de/
> > > >  mathops.h      |    2 +
> > > >  mpegaudiodec.c |   26 ++++++++++++---
> > > >  sh4/mathops.h  |   97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  3 files changed, 120 insertions(+), 5 deletions(-)
> > > > ad802c254a9f25cf487eba7d3cb584112c6c59b9  ff_mpegaudio_sh4-v3.patch
> > > > From: Guennadi Liakhovetski <g.liakhovetski at gmx.de>
> > > > Subject: [PATCH v3] SH4 mpegaudio decoder optimizations
> > > > 
> > > > This patch implements several mpegaudio optimizations for SH4 FPU-enabled SoCs.
> > > > Verified to provide more than 40% acceleration, when decoding a 128kbps stereo
> > > > mp3 audio file to /dev/null.
> > > > ---
> > > > 
> > > > v3:
> > > > 1. added a "&" constraint to input-output assembly parameters
> > > > 
> > > > v2:
> > > > 1. fixed copyright year to 2010
> > > > 2. consistent lower-case register names
> > > > 3. joined inline assembly macros into one
> > > > 
> > > > diff --git a/libavcodec/mathops.h b/libavcodec/mathops.h
> > > > index 4d88ed1..b234ae4 100644
> > > > --- a/libavcodec/mathops.h
> > > > +++ b/libavcodec/mathops.h
> > > > @@ -34,6 +34,8 @@
> > > >  #   include "mips/mathops.h"
> > > >  #elif ARCH_PPC
> > > >  #   include "ppc/mathops.h"
> > > > +#elif ARCH_SH4
> > > > +#   include "sh4/mathops.h"
> > > >  #elif ARCH_X86
> > > >  #   include "x86/mathops.h"
> > > >  #endif
> > > > diff --git a/libavcodec/mpegaudiodec.c b/libavcodec/mpegaudiodec.c
> > > > index 769be89..513fa1b 100644
> > > > --- a/libavcodec/mpegaudiodec.c
> > > > +++ b/libavcodec/mpegaudiodec.c
> > > > @@ -584,6 +584,22 @@ static inline int round_sample(int64_t *sum)
> > > >      op(sum, (w)[7 * 64], (p)[7 * 64]);    \
> > > >  }
> > > >  
> > > > +#ifndef SUM8_MACS
> > > > +#define SUM8_MACS(sum, w, p) SUM8(MACS, sum, w, p)
> > > > +#endif
> > > > +
> > > > +#ifndef SUM8_MLSS
> > > > +#define SUM8_MLSS(sum, w, p) SUM8(MLSS, sum, w, p)
> > > > +#endif
> > > > +
> > > > +#ifndef SUM8P2_MACS_MLSS
> > > > +#define SUM8P2_MACS_MLSS(sum, sum2, w, w2, p) SUM8P2(sum, MACS, sum2, MLSS, w, w2, p)
> > > > +#endif
> > > > +
> > > > +#ifndef SUM8P2_MLSS_MLSS
> > > > +#define SUM8P2_MLSS_MLSS(sum, sum2, w, w2, p) SUM8P2(sum, MLSS, sum2, MLSS, w, w2, p)
> > > > +#endif
> > > > +
> > > >  #define SUM8P2(sum1, op1, sum2, op2, w1, w2, p) \
> > > >  {                                               \
> > > >      INTFLOAT tmp;\
> > > > @@ -666,9 +682,9 @@ static void apply_window_mp3_c(MPA_INT *synth_buf, MPA_INT *window,
> > > >  
> > > >      sum = *dither_state;
> > > >      p = synth_buf + 16;
> > > > -    SUM8(MACS, sum, w, p);
> > > > +    SUM8_MACS(sum, w, p);
> > > >      p = synth_buf + 48;
> > > > -    SUM8(MLSS, sum, w + 32, p);
> > > > +    SUM8_MLSS(sum, w + 32, p);
> > > >      *samples = round_sample(&sum);
> > > >      samples += incr;
> > > >      w++;
> > > 
> > > This is ugly
> > > You should implement the API not change it in one file and implement the
> > > modified API
> > 
> > This doesn't change any API. This just adds a couple of macros completely 
> > internal to this file, no existing users are affected.
> 
> We have SUM8() (existing API)
> You add SUM8_MACS() / SUM8_MLSS() which are a differnt API to do the same
> You optimnize your API and switch mpegaudiodec to the new API
> 
> Thus we have 2 different APIs and only one of them has your optimizations,
> this case is quite bad and not something i want to have to maintain

Ok, how about this: I copy this complete apply_window_mp3_c() function to 
an sh-specific file and use my macro as is without any further changes? I 
would prefer to keep the loop in C, and not convert it to asm as you 
suggest below. This just would produce too much asm to my taste. But now 
that will all be only inside one single arch - sh4. Would this be 
acceptable?

Thanks
Guennadi

> > > [...]
> > > > +
> > > > +#define SH_SUM8_MACS(sum, w, p)                      \
> > > > +do {                                                 \
> > > > +    register const int32_t *wp = (w), *pp = (p);     \
> > > > +    register int32_t tmp = 63 * 4;                   \
> > > > +    union {int64_t x; int32_t u32[2];} u =           \
> > > > +                    {.x = (sum),};                   \
> > > > +    __asm__ volatile(                                \
> > > > +    "    lds %[hi], mach\n"                          \
> > > > +    "    lds %[lo], macl\n"                          \
> > > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 0 */           \
> > > > +    "    add %[tmp], %[wp]\n"                        \
> > > > +    "    add %[tmp], %[pp]\n"                        \
> > > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 1 */           \
> > > > +    "    add %[tmp], %[wp]\n"                        \
> > > > +    "    add %[tmp], %[pp]\n"                        \
> > > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 2 */           \
> > > > +    "    add %[tmp], %[wp]\n"                        \
> > > > +    "    add %[tmp], %[pp]\n"                        \
> > > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 3 */           \
> > > > +    "    add %[tmp], %[wp]\n"                        \
> > > > +    "    add %[tmp], %[pp]\n"                        \
> > > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 4 */           \
> > > > +    "    add %[tmp], %[wp]\n"                        \
> > > > +    "    add %[tmp], %[pp]\n"                        \
> > > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 5 */           \
> > > > +    "    add %[tmp], %[wp]\n"                        \
> > > > +    "    add %[tmp], %[pp]\n"                        \
> > > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 6 */           \
> > > > +    "    add %[tmp], %[wp]\n"                        \
> > > > +    "    add %[tmp], %[pp]\n"                        \
> > > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 7 */           \
> > > > +    "    sts mach, %[hi]\n"                          \
> > > > +    "    sts macl, %[lo]"                            \
> > > > +    : [hi] "+&r" (u.u32[1]),                         \
> > > > +      [lo] "+&r" (u.u32[0]),                         \
> > > > +      [wp] "+&r" (wp),                               \
> > > > +      [pp] "+&r" (pp)                                \
> > > > +    : [tmp] "r" (tmp)                                \
> > > > +    : "mach", "macl");                               \
> > > > +    sum = u.x;                                       \
> > > > +} while (0)
> > > 
> > > Write the whole loop in asm, leaving it to gcc is not a good idea it can just
> > > mess up.
> > 
> > Sorry, what do you mean? how can it "just mess up?"
> 
> Compilers turn C code into asm,they arent very good at it, which is why you can
> make things 40% faster. They arent better at doing loops thus speed critical
> ones should better be written inside asm instead of C loop around asm
> 
> 
> > Actually, what loop do
> > you mean? You don't mean the "do {...} while (0)" construct, do you?
> 
> this: (iam pretty sure you can do better than gcc)
>     for(j=1;j<16;j++) {
>         sum2 = 0;
>         p = synth_buf + 16 + j;
>         SUM8P2(sum, MACS, sum2, MLSS, w, w2, p);
>         p = synth_buf + 48 - j;
>         SUM8P2(sum, MLSS, sum2, MLSS, w + 32, w2 + 32, p);
> 
>         *samples = round_sample(&sum);
>         samples += incr;
>         sum += sum2;
>         *samples2 = round_sample(&sum);
>         samples2 -= incr;
>         w++;
>         w2--;
>     }
> 
> [...]
> -- 
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> If you really think that XML is the answer, then you definitly missunderstood
> the question -- Attila Kinali
> 

---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/



More information about the ffmpeg-devel mailing list