[FFmpeg-devel] [PATCH] h264: assembly version of get_cabac for x86_64 with PIC
michaelni at gmx.at
Fri Apr 13 10:34:08 CEST 2012
On Fri, Apr 13, 2012 at 05:13:46AM +0000, Loren Merritt wrote:
> On Fri, 13 Apr 2012, Roland Scheidegger wrote:
> > This adds a hand-optimized assembly version for get_cabac much like the
> > existing one, but it works if the table offsets are RIP-relative.
> > Compared to the non-RIP-relative version this adds 4 instructions
> > (3 RIP-relative movs, 1 lea) and needs one extra register, two of the
> > rip-relative movs could get eliminated by using a single table and using offets
> > instead.
> > Since x86_64 cpus always support cmov also always use this (I don't care
> > if you have a P4 Prescott whose cmov implementation is useless).
> > There is a surprisingly large performance improvement over the c version (more
> > so than the generated assembly seems to suggest) just in get_cabac, I measured
> > roughly 40% faster for get_cabac on a K8.
> > There are similar functions which could get the same treatment but they
> > are less frequently used and since this isn't very nice as we can't use the
> > same assembly template focus on this function alone for now.
> > mov ff_h264_lps_range at GOTPCREL(%%rip), "tmp2q"
> > movzbl ("tmp2q", %%rcx), "range"
> > mov ff_h264_norm_shift at GOTPCREL(%%rip), "tmp2q"
> > movzbl ("tmp2q", "rangeq"), %%ecx
> > mov ff_h264_mlps_state at GOTPCREL(%%rip), "tmpq"
> > movzbl 128("tmpq", "retq"), "tmp"
> @GOTPCREL isn't actually necessary unless you want the application to be
> able to override those symbols (which we don't).
> lea ff_h264_lps_range(%%rip), "tmp2q"
> movzbl ("tmp2q", %%rcx), "range"
> movzbl ff_h264_norm_shift-ff_h264_lps_range("tmp2q", "rangeq"), %%ecx
> movzbl ff_h264_mlps_state-ff_h264_lps_range+128("tmpq", "retq"), "tmp"
> ...Which fails to compile. Well, you can do something like that in yasm,
> but I don't know how to subtract one symbol from another in inline asm.
If the symbols are defined in a single yasm file/object then it can
get the difference, otherwise i dont see how but i might be missing
The same, would work with a inline asm block, that is having all the
tables in a single one, then their relative addresses are constant
and can be hardcoded (as litteral numbers).
or all the data could be put in an array in C in which case too the
relative locations would be known.
In theory one could calculate the offset in C and pass this as a
argument to the asm but gcc isnt smart enough to create sane C code.
none of these are pretty, not sure what is best.
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
When you are offended at any man's fault, turn to yourself and study your
own failings. Then you will forget your anger. -- Epictetus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 198 bytes
Desc: Digital signature
More information about the ffmpeg-devel