[FFmpeg-devel] [flamefest-start] A little something on MMX/SSE intrinsics

Michael Niedermayer michaelni
Thu Feb 28 20:37:21 CET 2008

On Thu, Feb 28, 2008 at 02:44:51PM +0100, Luca Barbato wrote:
> > 
> > I guess you can always try, though, but don't do anything to
> > discourage people who know altivec from adding more. There's still a
> > lot missing from H.264.
> I know, no time to hack that stuff lately...
> That said I think won't hurt getting some people solve this issue 
> between me (liking intrisics for powerpc/spu development) and michael 
> (disparaging them since his arch has them uglier and apparently slower)

I feel like iam talking against brick walls. The point is that intrinsics
are flawed because they are unpredictable, gcc could generate efficient
code from them, but it as well can (and does in current versions on x86)
generate completely dismal code. This does not go away if gcc becomes better
at generating code.

We write asm/intrinsics because gcc did NOT compile the C code to something
efficient in at least some cases. Asm is optimized once and will then always
be efficient for the cpu class for which it has been optimized. That is its
a write once and forget thing. Intrinsics OTOH are at the mercy of the
current compiler version and require constant maintaince to ensure that they
dont get miscompiled to something inefficient.

You can disagree that 5% speed difference does matter, in which case one
might get away with intrinsics.

But the key advantage asm() has IMO is that the compiler can NOT second guess
what the programmer wanted, it can NOT reorder the instructions behind the
programmers back and it can NOT silently put unneeded load+stores between
Its a fundamental difference, not something which will go away as gcc becomes
better at compiling intrinsics (if that ever will happen ...).

Also just in case anyone is curious about ICC performance with intrinsics
intels application note about the SSE2 IDCT (AP-945 Using SSE2 to implement an
Inverse Discrete Cosine Transform) contains a plain asm and a intinsics
version both with benchmarks:

SSE2 ASM        0.255 microseconds
SSE2 intrinsics 0.277 microseconds

theres a 8% speed loss

As far as i can see the only people supporting intrinsics either
A. cant code asm
B. never properly compared asm and intrinsics

If iam wrong, please show me an example with altivec asm which you hand
tuned (instructions optimally selcted and ordered by hand based on read and
understood datasheets for the target cpu and the final instruction ordering
selected by benchmark trial and error) and benchmark results against the
equivalent intrinsic code.

It seems our disagreement is not about intrinsics vs. asm being better but
about the minimum quality and performance of the code. 5% speedloss is not
acceptable! Even much smaller speedlosses need some justification.
Yes asm is harder to write, but for that you get 5% more speed.
And code quality standards in ffmpeg are high, writing 5% slower code is
plain unacceptable.

Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No snowflake in an avalanche ever feels responsible. -- Voltaire
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080228/4523b08c/attachment.pgp>

More information about the ffmpeg-devel mailing list