[FFmpeg-devel] [PATCH][RFC] Lagarith Decoder.

Sat Aug 15 05:51:49 CEST 2009

2009/8/14 M?ns Rullg?rd <mans at mansr.com>:
> Nathan Caldwell <saintdev at gmail.com> writes:
>
>> On Wed, Aug 12, 2009 at 7:54 AM, Reimar
>> D?ffinger<Reimar.Doeffinger at gmx.de> wrote:
>>> On Wed, Aug 12, 2009 at 02:12:55PM +0200, Michael Niedermayer wrote:
>>>> On Mon, Aug 10, 2009 at 11:42:19PM -0600, Nathan Caldwell wrote:
>>>> > On Sat, Aug 8, 2009 at 6:32 AM, Michael Niedermayer<michaelni at gmx.at> wrote:
>>>> > >> +/* Fast round up to least power of 2 >= to x */
>>>> > >> +static inline uint32_t clp2(uint32_t x)
>>>> > >> +{
>>>> > >> + ? ?x--;
>>>> > >> + ? ?x |= (x >> 1);
>>>> > >> + ? ?x |= (x >> 2);
>>>> > >> + ? ?x |= (x >> 4);
>>>> > >> + ? ?x |= (x >> 8);
>>>> > >> + ? ?x |= (x >> 16);
>>>> > >> + ? ?return x+1;
>>>> > >> +}
>>>> > >
>>>> > > is 1<<av_log2(x) faster?
>>>> >
>>>> > Might be, but it gives different results, so it's a moot point.
>>>>
>>>> 2<<av_log2(x-1)
>>>> or whatever
>>>
>>> Well, that all depends on what input range is needed.
>>> E.g. for 0 the documentation does not match the behaviour
>>> for the original function (returns 0 which is not even a
>>> power of 2).
>>> In the worst case, you'd have to do
>>> return x > 1 ? 2 << av_log(x - 1) : x;
>>> I think, which has a small but still existing chance of
>>> being faster.
>>
>> Well, that went OT rather quickly, lol.
>> 0 input doesn't really matter. If we have a cumulative probability of
>> 0, then that means all probabilities are 0 and we have larger problems
>> than nearest power of 2 being incorrect.
>> Anyway, for my tests cpl2 was faster than av_log2 by quite a large
>> margin ~2000 dezicycles for av_log2 vs. ~400 dezicycles for cpl2
>> tested on both Core2 and lolAtom and got the same results). However
>> this is only run once per plane, and av_log2 looks cleaner, so I'll
>> just use it instead.
>
> Did you try using an av_log2() implementation using CLZ, BSR or
> similar instructions? ?The shift/or sequence above may or may not be
> faster than the current av_log2(). ?I timed a few variants on ARM, and
> got these numbers:
>
> 2<<av_log2(x-1) w/ gcc: ? ? ? ? ? ? ? ? 14 cycles
> 2<<av_log2(x-1) naively hand-assembled: 11
> clp2() above w/ gcc (doesn't mess up): ?12
> hand-written asm using CLZ: ? ? ? ? ? ? ?5

No, I didn't. I just benchmarked av_log2() vs. clp2(), then replaced
the code with av_log2(). As I said this is only run once per plane, so
I'm not too concerned that it's not the absolute fastest.

-- 
-Nathan Caldwell