[FFmpeg-devel] [PATCH] avutil/mathematics: speed up av_gcd by using Stein's binary GCD algorithm

Ganesh Ajjanagadde gajjanag at mit.edu
Sun Oct 11 02:34:02 CEST 2015


On Sat, Oct 10, 2015 at 8:22 PM, Michael Niedermayer
<michael at niedermayer.cc> wrote:
> On Sun, Oct 11, 2015 at 02:13:59AM +0200, Michael Niedermayer wrote:
>> On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote:
>> > On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer
>> > <michael at niedermayer.cc> wrote:
>> > > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote:
>> > >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde
>> > >> <gajjanagadde at gmail.com> wrote:
>> > >> > This uses Stein's binary GCD algorithm:
>> > >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm
>> > >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard architectures.
>> > >> > Have not benchmarked, so I can't comment
>> > >>
>> > >> Before you submit a patch that's supposed to make something faster,
>> > >> you should benchmark it to verify that it is in fact faster. Do this
>> > >> with inputs of various sizes on both 32- and 64-bit architectures and
>> > >> both with and without compilers that support __builtin_ctzll(v).
>> > >
>> > > without __builtin_ctzll() the old code seems faster in a simple test
>> > > like:
>> > > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov
>> >
>> > >
>> > > also it seems a simple
>> > >     while (u != v) {
>> > >         if (u > v)
>> > >             FFSWAP(int64_t, v, u);
>> > >         v -= u;
>> > >         do {
>> > >             v >>= 1;
>> > >         } while(!(v & 1));
>> > >     }
>> > > is faster than the non builtin ctzll in its place but still not as
>> > > fast as the current code
>> >
>> > So I checked out the various implementation of (non built in) ctzll at
>> > https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch.
>> > The funny thing is, among the looping ones, the fastest is the simplest :):
>> > static inline int ff_ctzll_c(long long v)
>> > {
>> >     int c = 0;
>> >     while (!(v & 1)) {
>> >         c++;
>> >         v >>= 1;
>> >     }
>> >     return c;
>> > }
>
> tested this with the matrixbench_mpeg2 ->mov test and it seems
> 20% or so slower than the LUT based version

If the LUT seems to be a good idea, I think the De Bruijn one should
be even better - no branching, only a multiply. and some shifts etc
Cache effects should be similar for the LUT and the De Bruijn, perhaps
slightly in favor of De Bruijn due to a smaller table (64 vs 256
entries).

Most places I see only the 32 bit version, so I can't quickly check this.

>
>
>>
>> its possible gcc/clang recognizes this and replaces it by the
>> instruction for the builtin
>> this is perfectly fine when such an instruction is available
>> but when not this might be slower than what we would expect from
>> just testing on architectures with such instructions
>
> gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
> fails to replace the loop by anything smart
>
> you can use
> make libavutil/mathematics.s
>
> to compile the C code to asm and see how the functions get compiled
>
> [...]
>
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> What does censorship reveal? It reveals fear. -- Julian Assange
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>


More information about the ffmpeg-devel mailing list