[FFmpeg-devel] [FFmpeg-cvslog] r12171 - trunk/doc/optimization.txt

Thu Feb 21 20:28:10 CET 2008

On Thu, Feb 21, 2008 at 09:16:39PM +0200, ?smail D?nmez wrote:
> Hi,
> 
> On Thu, Feb 21, 2008 at 9:11 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> > On Thu, Feb 21, 2008 at 08:52:17PM +0200, ?smail D?nmez wrote:
> >  > Hi,
> >  >
> >  > >Author: melanson
> >  > >Date: Thu Feb 21 19:46:49 2008
> >  > >New Revision: 12171
> >  > >
> >  > >Log:
> >  > >minor English corrections
> >  > >
> >  > >
> >  > >Modified:
> >  > >  trunk/doc/optimization.txt
> >  > [...]
> >  > >  -Use asm() instead of intrinsics. Later requires a good optimizing compiler
> >  > >  +Use asm() instead of intrinsics. The latter requires a good optimizing compiler
> >  > >   which gcc is not.
> >  >
> >  > We all know this is FUD now, I know Michael still uses gcc 2.95 but
> >  > the world have moved on. GCC 4.3 is about to be released.
> >  > So please either backup these claims or note that this is not true for
> >  > recent GCCs.
> >
> >  I use gcc r132072 ATM, i admit its a few days old, do you claim that gcc
> >  was rewritten yesterday?
> >
> >  Also to backup the claim, the following was suggested to me a few days ago:
> >  -static inline void diff_pixels_mmx(DCTELEM *block, const uint8_t *s1, const uint8_t *s2, int stride)
> >  +static void diff_pixels_mmx(DCTELEM *block, const uint8_t *s1, const uint8_t *s2, long stride)
> >   {
> >  -    asm volatile(
> >  -        "pxor %%mm7, %%mm7              \n\t"
> >  -        "mov $-128, %%"REG_a"           \n\t"
> >  -        ASMALIGN(4)
> >  -        "1:                             \n\t"
> >  -        "movq (%0), %%mm0               \n\t"
> >  -        "movq (%1), %%mm2               \n\t"
> >  -        "movq %%mm0, %%mm1              \n\t"
> >  -        "movq %%mm2, %%mm3              \n\t"
> >  -        "punpcklbw %%mm7, %%mm0         \n\t"
> >  -        "punpckhbw %%mm7, %%mm1         \n\t"
> >  -        "punpcklbw %%mm7, %%mm2         \n\t"
> >  -        "punpckhbw %%mm7, %%mm3         \n\t"
> >  -        "psubw %%mm2, %%mm0             \n\t"
> >  -        "psubw %%mm3, %%mm1             \n\t"
> >  -        "movq %%mm0, (%2, %%"REG_a")    \n\t"
> >  -        "movq %%mm1, 8(%2, %%"REG_a")   \n\t"
> >  -        "add %3, %0                     \n\t"
> >  -        "add %3, %1                     \n\t"
> >  -        "add $16, %%"REG_a"             \n\t"
> >  -        "jnz 1b                         \n\t"
> >  -        : "+r" (s1), "+r" (s2)
> >  -        : "r" (block+64), "r" ((long)stride)
> >  -        : "%"REG_a
> >  -    );
> >  +    long offset = -128;
> >  +    MOVQ_ZERO(mm7);
> >  +    do {
> >  +        asm volatile(
> >  +            "movq (%0), %%mm0         \n\t"
> >  +            "movq (%1), %%mm2         \n\t"
> >  +            "movq %%mm0, %%mm1        \n\t"
> >  +            "movq %%mm2, %%mm3        \n\t"
> >  +            "punpcklbw %%mm7, %%mm0   \n\t"
> >  +            "punpckhbw %%mm7, %%mm1   \n\t"
> >  +            "punpcklbw %%mm7, %%mm2   \n\t"
> >  +            "punpckhbw %%mm7, %%mm3   \n\t"
> >  +            "psubw %%mm2, %%mm0       \n\t"
> >  +            "psubw %%mm3, %%mm1       \n\t"
> >  +            "movq %%mm0, (%2, %4)     \n\t"
> >  +            "movq %%mm1, 8(%2, %4)    \n\t"
> >  +            : : "r" (s1), "r" (s2), "r" (block+64), "r" (stride), "r" (offset)
> >  +            : "memory");
> >  +        s1 += stride;
> >  +        s2 += stride;
> >  +        offset += 16;
> >  +    } while (offset < 0);
> >   }
> >
> >  the effect that has on the generated asm is:
> >  .L143:
> >         .loc 3 241 0
> >         leaq    (%rsi,%r8), %rdx
> >         leaq    (%r10,%r8), %rax
> >  #APP
> >  # 241 "dsputil_mmx.c" 1
> >         movq (%rdx), %mm0
> >         movq (%rax), %mm2
> >         movq %mm0, %mm1
> >         movq %mm2, %mm3
> >         punpcklbw %mm7, %mm0
> >         punpckhbw %mm7, %mm1
> >         punpcklbw %mm7, %mm2
> >         punpckhbw %mm7, %mm3
> >         psubw %mm2, %mm0
> >         psubw %mm3, %mm1
> >         movq %mm0, (%rdi, %r9)
> >         movq %mm1, 8(%rdi, %r9)
> >
> >  # 0 "" 2
> >         .loc 3 258 0
> >  #NO_APP
> >         addq    %rcx, %r8
> >         .loc 3 259 0
> >         addq    $16, %r9
> >         jne     .L143
> >  -------------
> >
> >  As you can see gcc injects 2 unneeded lea instructions in the innermost loop.
> >  And i think this is a very simple asm, if you want you can try this with some
> >  complex code, but i recommand that you have a few bags for vomit ready ...
> 
> If you can give an example based on complex asm we can report a bug to
> gcc, just saying gcc is not a good optimizer
> does not help anyone, do we have another better open source compiler?
> No. So if you have a better example of bad asm produced we can ask
> gcc developers.

Ill mail the next case i stumble across to you, but as i dont convert
asm to intrinsics or do-asm-while i probably wont stumble across one
soon.
Also you can just keep your eyes open, there tend to be various asm snippets
posted once every few weeks ...
reimar did just a few days ago post a ridiculous one which had alignments
at the wrong places, it wasnt the latest gcc though ...

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Why not whip the teacher when the pupil misbehaves? -- Diogenes of Sinope
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080221/70930e06/attachment.pgp>