[FFmpeg-devel] [PATCH rfc] use bswap builtins where available
Måns Rullgård
mans
Sat Aug 15 00:55:22 CEST 2009
Alexander Strange <astrange at ithinksw.com> writes:
> gcc 4.2+ provides __builtin_bswap32/64. Since it's usually a good idea
My gcc 4.2.4 seems to be missing them.
> to use these instead of asm (they can be optimized more, don't clobber
> flags, their size is known, etc) I tried using them for bswap_32/64.
>
> The resulting binary is ~32kb smaller on x86-32; it actually has less
> bswap instructions (3658 vs 4072) but this is likely due to more
> optimizations.
>
> H.264 CABAC:
> old: avg 4.274 min 4.274 max 4.274 std.dev. 0.0
> new: avg 4.25 min 4.25 max 4.25 std.dev. 0.0
>
> MPEG4:
> old: avg 0.599 min 0.599 max 0.599 std.dev. 0.0
> new: avg 0.598 min 0.598 max 0.598 std.dev. 0.0
>
> Unfortunately the code for __builtin_bswap64+gcc 4.2+x86-32 is
> terrible, although fine in later versions, so it's under
> HAVE_FAST_64BIT for now.
> And there's no __builtin_bswap16; (x>>8)|(x<<8) generates rotates on
> its own even with gcc2.95, but I ended up with a slightly larger
> binary when I tried it here.
Any figures for x86-64?
> Any different numbers for other architectures?
>
>
> Index: libavutil/bswap.h
> ===================================================================
> --- libavutil/bswap.h (revision 19639)
> +++ libavutil/bswap.h (working copy)
> @@ -30,7 +30,23 @@
> #include "config.h"
> #include "common.h"
>
> -#if ARCH_ARM
> +#if AV_GCC_VERSION_AT_LEAST(4,2)
> +
> +#define bswap_32 bswap_32
> +static av_always_inline av_const uint32_t bswap_32(uint32_t x)
> +{
> + return __builtin_bswap32(x);
> +}
> +
> +#if HAVE_FAST_64BIT
> +#define bswap_64 bswap_64
> +static av_always_inline av_const uint64_t bswap_64(uint64_t x)
> +{
> + return __builtin_bswap64(x);
> +}
> +#endif
> +
> +#elif ARCH_ARM
> # include "arm/bswap.h"
> #elif ARCH_BFIN
> # include "bfin/bswap.h"
>
All else aside, this should go *after* per-arch stuff and be
conditional on the macros being undefined. The arch-specific code
needs to be able to override gcc's mess.
The builtins are useless on ARM, where gcc generates calls to
__bswapdi2 and __bswapsi2, and also fails to do anything clever with
the 16-bit case (there is a REV16 instruction).
Same thing on AVR32, Blackfin, MIPS, and SH4.
On PPC32 it does reasonably, even for bswap64. On PPC64, it makes a
total mess of bswap64, grabbing 112 bytes of stack and calling
__bswapdi2. It should be noted that most of the bswap uses are in
conjunction with reading or writing memory, and PPC has special
byte-swapping load/store instructions which we use there.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list