[FFmpeg-devel] [PATCH][RFC] Lagarith Decoder.

Wed Aug 12 16:53:08 CEST 2009

On Wed, Aug 12, 2009 at 03:41:01PM +0100, M?ns Rullg?rd wrote:
> Reimar D?ffinger <Reimar.Doeffinger at gmx.de> writes:
> 
> > On Wed, Aug 12, 2009 at 02:12:55PM +0200, Michael Niedermayer wrote:
> >> On Mon, Aug 10, 2009 at 11:42:19PM -0600, Nathan Caldwell wrote:
> >> > On Sat, Aug 8, 2009 at 6:32 AM, Michael Niedermayer<michaelni at gmx.at> wrote:
> >> > >> +/* Fast round up to least power of 2 >= to x */
> >> > >> +static inline uint32_t clp2(uint32_t x)
> >> > >> +{
> >> > >> +    x--;
> >> > >> +    x |= (x >> 1);
> >> > >> +    x |= (x >> 2);
> >> > >> +    x |= (x >> 4);
> >> > >> +    x |= (x >> 8);
> >> > >> +    x |= (x >> 16);
> >> > >> +    return x+1;
> >> > >> +}
> >> > >
> >> > > is 1<<av_log2(x) faster?
> >> > 
> >> > Might be, but it gives different results, so it's a moot point.
> >> 
> >> 2<<av_log2(x-1)
> >> or whatever
> >
> > Well, that all depends on what input range is needed.
> > E.g. for 0 the documentation does not match the behaviour
> > for the original function (returns 0 which is not even a
> > power of 2).
> > In the worst case, you'd have to do
> > return x > 1 ? 2 << av_log(x - 1) : x;
> > I think, which has a small but still existing chance of
> > being faster.
> 
> That's still easy to optimise, at least for ARM:
> 
> subs  r1, r0, #1
> clz   r1, r1
> movgt r0, #2
> rsb   r1, r1, #31
> lslgt r0, r0, r1
> 
> This should be about twice as fast as the shift/or version.

Well, you still have to teach the compiler at least to use clz for
av_log2, I think you haven't yet ;-)
Would be even nicer if it could be extended to get rid of the
ff_log2_tab table assuming it is fast enough...
(and of course I meant av_log2, not av_log in the above code)
PPC has such an instruction, too...
Even x86 has the BSR instruction, it's just too slow on too many
implementations...