[FFmpeg-devel] [PATCH] Add x86-optimized versions of exponent_min().
Fri Feb 4 01:13:39 CET 2011
On 02/03/2011 06:47 PM, Loren Merritt wrote:
> On Thu, 3 Feb 2011, Justin Ruggles wrote:
>> On 02/03/2011 12:05 AM, Loren Merritt wrote:
>>> On Wed, 2 Feb 2011, Justin Ruggles wrote:
>>>> Thanks for the suggestion. Below is a chart of the results for
>>>> adding ALIGN 8 and ALIGN 16 before each of the 2 loops.
>>>> LOOP1/LOOP2 MMX MMX2 SSE2
>>>> NONE/NONE : 5270 5283 2757
>>>> NONE/8 : 5200 5077 2644
>>>> NONE/16 : 5723 3961 2161
>>>> 8/NONE : 5214 5339 2787
>>>> 8/8 : 5198* 5083 2722
>>>> 8/16 : 5936 3902 2128
>>>> 16/NONE : 6613 4788 2580
>>>> 16/8 : 5490 3702 2020
>>>> 16/16 : 5474 3680* 2000*
>>> Other things that affect instruction size/count and therefore alignment
>>> * compiling for x86_32 vs x86_64-unix vs win64
>>> * register size (d vs q as per my previous patch)
>>> * whether PIC is enabled (not relevant this time because this function
>>> doesn't use any static consts)
>> Doesn't yasm take these into account when using ALIGN?
> ALIGN computes the number of NOPs to add, into order to result in some
> address aligned by the requested amount. But that isn't necessarily
> solving the right problem. If align16 is in some cases slower than align8,
> then clearly it isn't just a case of being slow when it doesn't have
> "enough" alignment.
Indeed. I thought that was strange.
> One possible cause of such effects is that which instructions are packed
> into a 16byte aligned window affects the number of instructions that can
> be decoded at once. This applies to every instruction everywhere (if
> decoding is the bottleneck), not just at branch targets. Adding alignment
> at one place can bump some later instruction across a decoding window, and
> whether it does so depends on all of the size factors I mentioned.
Ok, that makes sense.
>>> * and sometimes not only the mod16 or mod64 alignment matters, but also
>>> the difference in memory address between this function and the rest of the
>>> While this isn't as bad as gcc's random code generator, don't assume
>>> that the optimum you found in one configuration will be non-pessimal in
>>> the others.
>>> If there is a single optimal place to add a single optimal number of NOPs,
>>> great. But often when I run into alignment weirdness, there is no such
>>> solution, and the best I can do is poke it with a stick until I find some
>>> combination of instructions that isn't so sensitive to alignment.
>> I don't have much to poke around with as far as using different
>> instructions in this case.
> One stick to poke with is unrolling.
>> So should we just accept what is an obvious bad case on one
>> configuration because there is a chance that fixing it is worse
>> in another?
> My expectation of the effect of this fix on the performance of the
> configurations you haven't benchmarked, is positive. If you don't want to
> benchmark them, I won't reject this patch on those grounds.
> I am merely saying that as long as you haven't identified the actual
> cause of the slowdowns, as long as performance is still random unto you,
> making decisions based on a thorough benchmark of only one compiler
> configuration is generalizing from one data point.
>> Even the worst case versions are 80-90% faster than the C version in the
>> tested configuration (x86_64 unix). Is it likely that the worst case
>> will be much slower in another?
> Not more than 40% slower. (Some confidence since on this question your
> benchmark counts as 24 data points, not 1.)
I can recompile with "--extra-cflags=-m32 --extra-ldflags=-m32" and add
24 more data points if you think this would be useful.
More information about the ffmpeg-devel