[FFmpeg-devel] [PATCH rfc] use bswap builtins where available

Måns Rullgård mans
Sat Aug 15 00:55:22 CEST 2009


Alexander Strange <astrange at ithinksw.com> writes:

> gcc 4.2+ provides __builtin_bswap32/64. Since it's usually a good idea

My gcc 4.2.4 seems to be missing them.

> to use these instead of asm (they can be optimized more, don't clobber
> flags, their size is known, etc) I tried using them for bswap_32/64.
>
> The resulting binary is ~32kb smaller on x86-32; it actually has less
> bswap instructions (3658 vs 4072) but this is likely due to more
> optimizations.
>
> H.264 CABAC:
> old: avg 4.274 min 4.274 max 4.274 std.dev. 0.0
> new: avg 4.25 min 4.25 max 4.25 std.dev. 0.0
>
> MPEG4:
> old: avg 0.599 min 0.599 max 0.599 std.dev. 0.0
> new: avg 0.598 min 0.598 max 0.598 std.dev. 0.0
>
> Unfortunately the code for __builtin_bswap64+gcc 4.2+x86-32 is
> terrible, although fine in later versions, so it's under
> HAVE_FAST_64BIT for now.
> And there's no __builtin_bswap16; (x>>8)|(x<<8) generates rotates on
> its own even with gcc2.95, but I ended up with a slightly larger
> binary when I tried it here.

Any figures for x86-64?

> Any different numbers for other architectures?
>
>
> Index: libavutil/bswap.h
> ===================================================================
> --- libavutil/bswap.h	(revision 19639)
> +++ libavutil/bswap.h	(working copy)
> @@ -30,7 +30,23 @@
>  #include "config.h"
>  #include "common.h"
>  
> -#if   ARCH_ARM
> +#if   AV_GCC_VERSION_AT_LEAST(4,2)
> +
> +#define bswap_32 bswap_32
> +static av_always_inline av_const uint32_t bswap_32(uint32_t x)
> +{
> +    return __builtin_bswap32(x);
> +}
> +
> +#if HAVE_FAST_64BIT
> +#define bswap_64 bswap_64
> +static av_always_inline av_const uint64_t bswap_64(uint64_t x)
> +{
> +    return __builtin_bswap64(x);
> +}
> +#endif
> +
> +#elif ARCH_ARM
>  #   include "arm/bswap.h"
>  #elif ARCH_BFIN
>  #   include "bfin/bswap.h"
>

All else aside, this should go *after* per-arch stuff and be
conditional on the macros being undefined.  The arch-specific code
needs to be able to override gcc's mess.

The builtins are useless on ARM, where gcc generates calls to
__bswapdi2 and __bswapsi2, and also fails to do anything clever with
the 16-bit case (there is a REV16 instruction).

Same thing on AVR32, Blackfin, MIPS, and SH4.

On PPC32 it does reasonably, even for bswap64.  On PPC64, it makes a
total mess of bswap64, grabbing 112 bytes of stack and calling
__bswapdi2.  It should be noted that most of the bswap uses are in
conjunction with reading or writing memory, and PPC has special
byte-swapping load/store instructions which we use there.

-- 
M?ns Rullg?rd
mans at mansr.com



More information about the ffmpeg-devel mailing list