[Ffmpeg-devel] [PATCH] (4) building with --disable-opts on i386

Michael Niedermayer michaelni
Sun Aug 13 21:52:10 CEST 2006


Hi

On Sun, Aug 13, 2006 at 05:25:41PM +0200, Marco Manfredini wrote:
> On Sunday 13 August 2006 00:50, Michael Niedermayer wrote:
> 
> > > the optimizer
> > > should remove this.
> >
> > can you check that it really does? gcc -S should produce compiled but not
> > assembled output ...
> 
> I've written a transpose8x8 routine using transpose4x4 to study the output. 
> Assembly revealed that none of my compilers removed the temporaries. I found 
> that strange and replaced the "m" constraint with the "X" constraint and 
> found completely different code, that performed much better. Replacing the 
> input constraints of the original routine with "X" turns out to produce 
> *worser* code.
> 
> I compared the runtimes of 1 Billion transpose8x8 on in-cache data. Tests were 
> done on two Suse 10.0 installations. "Modified Routine" means the patch + "X" 
> constraints instead of "m" for in_*.
> 
> Athlon XP 2000+ 
> Original Routine: 20.59 sec (gcc-3.4.6 -O3)
> Modified Routine: 16.03 sec (gcc-3.4.6 -O3)
> Original Routine: 17.00 sec (gcc-4.0.4 -O3)
> Modified Routine: 19.75 sec (gcc-4.0.3 -O3)
> Original Routine: 16.80 sec (gcc-4.1.1 -O3)
> Modified Routine: 16.80 sec (gcc-4.1.1 -O3)
> 
> Pentium-4 (2800MHZ) with EM64T
> Original Routine: 27.48 sec (gcc-4.0.3 -O3)
> Original Routine: 20.51 sec (gcc-4.0.3 -O3 -m32)
> Modified Routine: 22.36 sec (gcc-4.0.3 -O3) (!!)
> Modified Routine: 20.80 sec (gcc-4.0.3 -m32 -O3)
> 
> Original Routine: 27.56 sec (gcc-4.1.1 -O3)
> Original Routine: 20.59 sec (gcc-4.1.1 -O3 -m32)
> Modified Routine: 22.42 sec (gcc-4.1.1 -O3) (!!)
> Modified Routine: 20.84 sec (gcc-4.1.1 -m32 -O3)
> 
> - The sentence "The optimizer should remove this" is true for the gcc-4 
> release series and even truer for the case of the P4
> - I tried to get numbers for a dual-core MacTel and Apples gcc-4.0.1, but it 
> turned out that in the context of my routine the original transpose4x4 
> suffered register starvation even *with* -O3!
> - I checked transcoding a 30 Meg 10 times with "ffmpeg -y -flags +bitexact 
> -dct fastint -idct simple -y -qscale 10 -i monty.avi -vcodec rv20 -an 
> monty.rm". This is 5% slower with the original patch and has the same 
> runtimes with the modified patch - but only on a 4.* compiler. The 3.* and 
> 2.* series perform faster with the original routine.
> 
> This is a bit discouraging. I see no way to fix this, without either risk 
> performance degradation or switching between optimized and unoptimized 
> builds.
> 
> An observation is, that the register spill does not happen, if the passed 
> values are return values from function call. This would make the following 
> pattern possible: 
> 
> #ifdef DEBUG_BUILD
> static inline uint32_t fix_uint32_t(uint32_t t) { return t; }
> #else 
> #define fix_uint32(X) X
> #endif
> 
> int src_stride){
>     asm volatile( //FIXME could save 1 instruction if done as 8x4 ...
>         "movd  %4, %%mm0                \n\t"
>         "movd  %5, %%mm1                \n\t"
>         "movd  %6, %%mm2                \n\t"
>         "movd  %7, %%mm3                \n\t"
>         "punpcklbw %%mm1, %%mm0         \n\t"
>         "punpcklbw %%mm3, %%mm2         \n\t"
>         "movq %%mm0, %%mm1              \n\t"
>         "punpcklwd %%mm2, %%mm0         \n\t"
>         "punpckhwd %%mm2, %%mm1         \n\t"
>         "movd  %%mm0, %0                \n\t"
>         "punpckhdq %%mm0, %%mm0         \n\t"
>         "movd  %%mm0, %1                \n\t"
>         "movd  %%mm1, %2                \n\t"
>         "punpckhdq %%mm1, %%mm1         \n\t"
>         "movd  %%mm1, %3                \n\t"
> 
>         : "=m" (*(uint32_t*)(dst + 0*dst_stride)),
>           "=m" (*(uint32_t*)(dst + 1*dst_stride)),
>           "=m" (*(uint32_t*)(dst + 2*dst_stride)),
>           "=m" (*(uint32_t*)(dst + 3*dst_stride))
>         :  "m" (fix_uint32(*(uint32_t*)(src + 0*src_stride))), /* "fix" input 
> if it doesn't compile in -O0 */
>            "m" (fix_uint32(*(uint32_t*)(src + 1*src_stride))),
>            "m" (fix_uint32(*(uint32_t*)(src + 2*src_stride))),
>            "m" (fix_uint32(*(uint32_t*)(src + 3*src_stride)))
>     );
> }
> 
> What do you think? 

looks funny, but ive no strong objection against it if it works and isnt
slower for recent gcc with optimizations on

some other alternative would be to rewrite the code so as to do the addresing
manually, instead of letting gcc play with it:
"r" (dst), "r"(dst_stride), "r" (dst + 2*dst_stride), 
"r" (src), "r"(src_stride), "r" (src + 2*src_stride)

and use (%0), (%0,%1), (%2), (%2,%1), ...

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

In the past you could go to a library and read, borrow or copy any book
Today you'd get arrested for mere telling someone where the library is




More information about the ffmpeg-devel mailing list