[FFmpeg-devel] [PATCH] Add x86-optimized versions of exponent_min().

Fri Feb 4 00:47:12 CET 2011

On Thu, 3 Feb 2011, Justin Ruggles wrote:
> On 02/03/2011 12:05 AM, Loren Merritt wrote:
>> On Wed, 2 Feb 2011, Justin Ruggles wrote:
>>>
>>> Thanks for the suggestion.  Below is a chart of the results for
>>> adding ALIGN 8 and ALIGN 16 before each of the 2 loops.
>>>
>>> LOOP1/LOOP2   MMX   MMX2   SSE2
>>> -------------------------------
>>> NONE/NONE :  5270   5283   2757
>>>    NONE/8 :  5200   5077   2644
>>>   NONE/16 :  5723   3961   2161
>>>    8/NONE :  5214   5339   2787
>>>       8/8 :  5198*  5083   2722
>>>      8/16 :  5936   3902   2128
>>>   16/NONE :  6613   4788   2580
>>>      16/8 :  5490   3702   2020
>>>     16/16 :  5474   3680*  2000*
>>
>> Other things that affect instruction size/count and therefore alignment
>> include:
>> * compiling for x86_32 vs x86_64-unix vs win64
>> * register size (d vs q as per my previous patch)
>> * whether PIC is enabled (not relevant this time because this function
>> doesn't use any static consts)
>
> Doesn't yasm take these into account when using ALIGN?

ALIGN computes the number of NOPs to add, into order to result in some 
address aligned by the requested amount. But that isn't necessarily 
solving the right problem. If align16 is in some cases slower than align8, 
then clearly it isn't just a case of being slow when it doesn't have 
"enough" alignment.
One possible cause of such effects is that which instructions are packed 
into a 16byte aligned window affects the number of instructions that can 
be decoded at once. This applies to every instruction everywhere (if 
decoding is the bottleneck), not just at branch targets. Adding alignment 
at one place can bump some later instruction across a decoding window, and 
whether it does so depends on all of the size factors I mentioned.

>> * and sometimes not only the mod16 or mod64 alignment matters, but also
>> the difference in memory address between this function and the rest of the
>> library.
>>
>> While this isn't as bad as gcc's random code generator, don't assume 
>> that the optimum you found in one configuration will be non-pessimal in 
>> the others.
>> If there is a single optimal place to add a single optimal number of NOPs, 
>> great. But often when I run into alignment weirdness, there is no such 
>> solution, and the best I can do is poke it with a stick until I find some 
>> combination of instructions that isn't so sensitive to alignment.
>
> I don't have much to poke around with as far as using different 
> instructions in this case.

One stick to poke with is unrolling.

> So should we just accept what is an obvious bad case on one 
> configuration because there is a chance that fixing it is worse 
> in another?

My expectation of the effect of this fix on the performance of the 
configurations you haven't benchmarked, is positive. If you don't want to 
benchmark them, I won't reject this patch on those grounds.

I am merely saying that as long as you haven't identified the actual 
cause of the slowdowns, as long as performance is still random unto you, 
making decisions based on a thorough benchmark of only one compiler 
configuration is generalizing from one data point.

> Even the worst case versions are 80-90% faster than the C version in the 
> tested configuration (x86_64 unix). Is it likely that the worst case 
> will be much slower in another?

Not more than 40% slower. (Some confidence since on this question your 
benchmark counts as 24 data points, not 1.)

--Loren Merritt