[FFmpeg-devel] [PATCH][RFC] Lagarith Decoder.

Reimar Döffinger Reimar.Doeffinger
Wed Aug 12 17:27:36 CEST 2009


On Wed, Aug 12, 2009 at 04:12:25PM +0100, M?ns Rullg?rd wrote:
> Reimar D?ffinger <Reimar.Doeffinger at gmx.de> writes:
> 
> > On Wed, Aug 12, 2009 at 03:41:01PM +0100, M?ns Rullg?rd wrote:
> >> Reimar D?ffinger <Reimar.Doeffinger at gmx.de> writes:
> >> 
> >> > On Wed, Aug 12, 2009 at 02:12:55PM +0200, Michael Niedermayer wrote:
> >> >> On Mon, Aug 10, 2009 at 11:42:19PM -0600, Nathan Caldwell wrote:
> >> >> > On Sat, Aug 8, 2009 at 6:32 AM, Michael Niedermayer<michaelni at gmx.at> wrote:
> >> >> > >> +/* Fast round up to least power of 2 >= to x */
> >> >> > >> +static inline uint32_t clp2(uint32_t x)
> >> >> > >> +{
> >> >> > >> +    x--;
> >> >> > >> +    x |= (x >> 1);
> >> >> > >> +    x |= (x >> 2);
> >> >> > >> +    x |= (x >> 4);
> >> >> > >> +    x |= (x >> 8);
> >> >> > >> +    x |= (x >> 16);
> >> >> > >> +    return x+1;
> >> >> > >> +}
> >> >> > >
> >> >> > > is 1<<av_log2(x) faster?
> >> >> > 
> >> >> > Might be, but it gives different results, so it's a moot point.
> >> >> 
> >> >> 2<<av_log2(x-1)
> >> >> or whatever
> >> >
> >> > Well, that all depends on what input range is needed.
> >> > E.g. for 0 the documentation does not match the behaviour
> >> > for the original function (returns 0 which is not even a
> >> > power of 2).
> >> > In the worst case, you'd have to do
> >> > return x > 1 ? 2 << av_log(x - 1) : x;
> >> > I think, which has a small but still existing chance of
> >> > being faster.
> >> 
> >> That's still easy to optimise, at least for ARM:
> >> 
> >> subs  r1, r0, #1
> >> clz   r1, r1
> >> movgt r0, #2
> >> rsb   r1, r1, #31
> >> lslgt r0, r0, r1
> >> 
> >> This should be about twice as fast as the shift/or version.
> >
> > Well, you still have to teach the compiler at least to use clz for
> > av_log2, I think you haven't yet ;-)
> 
> I can't because it's in common.h, which is installed.  We really
> should find a way to fix that.

Just put the optimizations under HAVE_AV_CONFIG_H like everything else
in there that is messy?
Btw. it might be worth investigating on x86, too: on most Intel Core the BSR
is 1 cycle latency, and on AMD 64 it is still 4 cycles.
Of course Intel Atom, P4, Athlon XP etc. are quite an issue
with 11 - 18 or so cycles latency (ignoring 486 with 30+3*n or so
cycles)...



More information about the ffmpeg-devel mailing list