[FFmpeg-devel] [PATCH] avutil/mathematics: speed up av_gcd by using Stein's binary GCD algorithm

Michael Niedermayer michael at niedermayer.cc
Sun Oct 11 02:22:21 CEST 2015


On Sun, Oct 11, 2015 at 02:13:59AM +0200, Michael Niedermayer wrote:
> On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote:
> > On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer
> > <michael at niedermayer.cc> wrote:
> > > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote:
> > >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde
> > >> <gajjanagadde at gmail.com> wrote:
> > >> > This uses Stein's binary GCD algorithm:
> > >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm
> > >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard architectures.
> > >> > Have not benchmarked, so I can't comment
> > >>
> > >> Before you submit a patch that's supposed to make something faster,
> > >> you should benchmark it to verify that it is in fact faster. Do this
> > >> with inputs of various sizes on both 32- and 64-bit architectures and
> > >> both with and without compilers that support __builtin_ctzll(v).
> > >
> > > without __builtin_ctzll() the old code seems faster in a simple test
> > > like:
> > > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov
> > 
> > >
> > > also it seems a simple
> > >     while (u != v) {
> > >         if (u > v)
> > >             FFSWAP(int64_t, v, u);
> > >         v -= u;
> > >         do {
> > >             v >>= 1;
> > >         } while(!(v & 1));
> > >     }
> > > is faster than the non builtin ctzll in its place but still not as
> > > fast as the current code
> > 
> > So I checked out the various implementation of (non built in) ctzll at
> > https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch.
> > The funny thing is, among the looping ones, the fastest is the simplest :):
> > static inline int ff_ctzll_c(long long v)
> > {
> >     int c = 0;
> >     while (!(v & 1)) {
> >         c++;
> >         v >>= 1;
> >     }
> >     return c;
> > }

tested this with the matrixbench_mpeg2 ->mov test and it seems
20% or so slower than the LUT based version


> 
> its possible gcc/clang recognizes this and replaces it by the
> instruction for the builtin
> this is perfectly fine when such an instruction is available
> but when not this might be slower than what we would expect from
> just testing on architectures with such instructions

gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
fails to replace the loop by anything smart

you can use
make libavutil/mathematics.s

to compile the C code to asm and see how the functions get compiled

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

What does censorship reveal? It reveals fear. -- Julian Assange
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20151011/f85bda03/attachment.sig>


More information about the ffmpeg-devel mailing list