[FFmpeg-devel] Inline ASM vs. Intrinsics

Fri May 11 14:06:11 CEST 2007

Hi,

On 5/11/07, Michael Niedermayer <michaelni at gmx.at> wrote:

> On Fri, May 11, 2007 at 09:25:38AM +0200, Guillaume POIRIER wrote:
> [...]
> > > > My question is if they are not used because of performance or if they
> > > > are a big NoNo because of some other reason.
> > > >
> > > > I know that by using inline asm one has most control over what is going
> > > > on. However with intrinsics the code is sometimes shorter and easier to
> > > > read,
> >
> > That's true for Altivec intrinsics, but x86 intrinsics are really
> > horrible IMHO. It codes the type of data in the intrinsic name rather
> > than by typing vectors.
> > That means that with Altivec, you have vec_add() and vec_adds() to
> > respectively do vector add, and vector saturated add, and on x86,
> > you'd have _mm_add8(), _mm_add16(), _mm_add32(), _mm_add64(),
> > _mm_adds8(), _mm_adds16(), , _mm_adds32(), _mm_adds64().
> > I think that this certainly isn't more readable, and that it's rather
> > ugly to have a "typeless" extension to a C language, which is a
> > strongly typed language.
> >
> > Off course, when you have an SIMD ISA that evolves with each new CPU
> > model, you have a harder time to do things clean like with Altivec
> > intrinsics.
>
> the whole intrinsic thing is really nothing else than a different syntax
> for asm, gcc could reorder instructions and it could allocate registers
> optimally for the target CPU but in practice it fails at both and
> hand optimized code will generally beat what gcc generated on all cpus
> also theres the issue you mention that different cpus support different
> instruction sets (3dnow vs, SSE2,  SSE3, ...) so in the end you have to
> write the code multiple times anyway if you want it to be perfect even
> with intrinsics ...

Sorry, that's not what I meant. I meant that when you have a complete
SIMD ISA like Altivec is, you can have a clean intrinsic language
extension.

What I mean by clean is that every computation (add, sub, xor, shift,
mul, madd, permutation, shuffle, unpack, pack, .....) exist for
_every_ vector types.

Let me take an example: horizontal add was introduced in SSE3 for
float and doubles, but was only introduced with SSSE3 for integer,
short, char....

With this kind of constrains, it's quite hard to design a decent
intrinsic language extension such as the one specified for Altivec.

> what gcc should rather do is analyze C code and compile it to SIMD
> 100% portable, no silly language extensions and gcc can generate the ideal
> optimal code

You mean generating SIMD code out of straight C code? Well, that's not
like it's simple task! Take H264 deblocking for instance. vectorizing
code that has conditionals is very difficult for a human, I doubt that
you can expect a compiler to do a better job there...

Maybe I misunderstood what you meant here.

> > I've experimented a bit with ICC-9.1 (not with GCC though), and
> > analysed the quality of the code generation. I'm pleased to say that
> > it generates really good code in general, but in some cases, it does
> > some stupid things that a human who has a tiny bit of ASM expertise
> > would never write.
> >
> > But in general, ICC did a really good job at generating code out of intrinsincs.
> >
> > I don't know about GCC, but I read a paper some month ago where the
> > bleeding edge versions of GCC were able to beath ICC on syntetic
> > benchmarks. I expect that on code that has a rather large data set,
> > GCC will screw up its register allocation, where ICC should do better.
>
> one problem with intrinsics also is that if the compiler screws up you
> have to rewrite the code to asm,

Yep. I'd like to note though that, inline ASM syntax being so
horrible, I bet many people just never bothered to write any SIMD code
because they hit their "brain wall" ;-)
Intrinsics has the advantage that you don't have to think about too
many things at once to get it to work (register allocation,
scheduling, defining the in/out of your inline asm block, defining the
clobbers...)....

In the end, off course inline ASM beats intrinsics' pants off, but you
have to consider that intrinsics is faster and easier to write, and
read (better maintainability).
Inline ASM had to be altered for GCC4 support, for MacIntel support.... etc....

I don't recall intrinsics requiring that much maintenance efforts...

> theres no working way to give it hints
> which variables should be when in a register, its the same with c code

Exactly. I wrongfully assumed that "register" keywork was honnored
with xmm/mm intrinsics, but I was wrong. It's simply ignored by ICC. I
don't know about GCC.

I don't mean to advocate for or against intrinsics;
I just mean to point out that when you have a well though-out ISA
(Altivec, 3-operands), with a decent amount of registers (32), you can
have clean intrinsics and efficient code generation.

When you have an ISA that evolves with each new generation of CPU, few
registers, 2-operands ISA, intrinsics names that suck, there a much
fewer reasons to use intrinsics when you want to optimize code for you
hot-spot.

Again, off course inline ASM allows to have faster code, but it _does_
have downsides.

Guillaume
-- 
Rich, you're forgetting one thing here: *everybody* except you is
stupid.
    M?ns Rullg?rd