[FFmpeg-devel] Inline ASM vs. Intrinsics

Zuxy Meng zuxy.meng
Fri May 11 14:30:16 CEST 2007


2007/5/11, Michael Niedermayer <michaelni at gmx.at>:
> Hi
> On Fri, May 11, 2007 at 09:25:38AM +0200, Guillaume POIRIER wrote:
> [...]
> > > > My question is if they are not used because of performance or if they
> > > > are a big NoNo because of some other reason.
> > > >
> > > > I know that by using inline asm one has most control over what is going
> > > > on. However with intrinsics the code is sometimes shorter and easier to
> > > > read,
> >
> > That's true for Altivec intrinsics, but x86 intrinsics are really
> > horrible IMHO. It codes the type of data in the intrinsic name rather
> > than by typing vectors.
> > That means that with Altivec, you have vec_add() and vec_adds() to
> > respectively do vector add, and vector saturated add, and on x86,
> > you'd have _mm_add8(), _mm_add16(), _mm_add32(), _mm_add64(),
> > _mm_adds8(), _mm_adds16(), , _mm_adds32(), _mm_adds64().
> > I think that this certainly isn't more readable, and that it's rather
> > ugly to have a "typeless" extension to a C language, which is a
> > strongly typed language.
> >
> > Off course, when you have an SIMD ISA that evolves with each new CPU
> > model, you have a harder time to do things clean like with Altivec
> > intrinsics.
> the whole intrinsic thing is really nothing else than a different syntax
> for asm, gcc could reorder instructions and it could allocate registers
> optimally for the target CPU but in practice it fails at both and
> hand optimized code will generally beat what gcc generated on all cpus
> also theres the issue you mention that different cpus support different
> instruction sets (3dnow vs, SSE2,  SSE3, ...) so in the end you have to
> write the code multiple times anyway if you want it to be perfect even
> with intrinsics ...

Two exceptions here: x86-64 may benefit from additional registers
without rewriting, and in some cases gcc can substitute several
instructions with fewer but equivalent ones e.g. SSE3's movsldup for
SSE's movaps+shufps (not sure if gcc trunk can substitute pxor, psub
and pmaxs with a single pabs when SSSE3's available though i guess it
should be there)

> what gcc should rather do is analyze C code and compile it to SIMD
> 100% portable, no silly language extensions and gcc can generate the ideal
> optimal code

from my experience gcc does worse in autovectorization than in
intrinsics, at least currently:-)
Beauty is truth,
While truth is beauty.
PGP KeyID: E8555ED6

More information about the ffmpeg-devel mailing list