[FFmpeg-devel] [PATCHv2] lavc/cbrt_tablegen: speed up tablegen
Ganesh Ajjanagadde
gajjanag at mit.edu
Mon Jan 11 23:21:22 CET 2016
On Fri, Jan 8, 2016 at 6:52 AM, Michael Niedermayer
<michael at niedermayer.cc> wrote:
> On Thu, Jan 07, 2016 at 05:20:55PM -0800, Ganesh Ajjanagadde wrote:
>> On Thu, Jan 7, 2016 at 4:48 PM, Michael Niedermayer
>> <michael at niedermayer.cc> wrote:
>> > On Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde wrote:
>> >> This exploits an approach based on the sieve of Eratosthenes, a popular
>> >> method for generating prime numbers.
>> >>
>> >> Tables are identical to previous ones.
>> >>
>> >> Tested with FATE with/without --enable-hardcoded-tables.
>> >>
>> >> Sample benchmark (Haswell, GNU/Linux+gcc):
>> >> prev:
>> >> 7860100 decicycles in cbrt_tableinit, 1 runs, 0 skips
>> >> 7777490 decicycles in cbrt_tableinit, 2 runs, 0 skips
>> >> [...]
>> >> 7582339 decicycles in cbrt_tableinit, 256 runs, 0 skips
>> >> 7563556 decicycles in cbrt_tableinit, 512 runs, 0 skips
>> >>
>> >> new:
>> >> 2099480 decicycles in cbrt_tableinit, 1 runs, 0 skips
>> >> 2044470 decicycles in cbrt_tableinit, 2 runs, 0 skips
>> >> [...]
>> >> 1796544 decicycles in cbrt_tableinit, 256 runs, 0 skips
>> >> 1791631 decicycles in cbrt_tableinit, 512 runs, 0 skips
>> >>
>> >> Both small and large run count given as this is called once so small run
>> >> count may give a better picture, small numbers are fairly consistent,
>> >> and there is a consistent downward trend from small to large runs,
>> >> at which point it stabilizes to a new value.
>> >>
>> >> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
>> >> ---
>> >> libavcodec/aacdec_fixed.c | 4 +--
>> >> libavcodec/aacdec_template.c | 2 +-
>> >> libavcodec/cbrt_tablegen.h | 53 ++++++++++++++++++++++++++-----------
>> >> libavcodec/cbrt_tablegen_template.c | 12 ++++++++-
>> >> 4 files changed, 51 insertions(+), 20 deletions(-)
>> >>
>> >> diff --git a/libavcodec/aacdec_fixed.c b/libavcodec/aacdec_fixed.c
>> >> index 396a874..f7b882b 100644
>> >> --- a/libavcodec/aacdec_fixed.c
>> >> +++ b/libavcodec/aacdec_fixed.c
>> >> @@ -155,9 +155,9 @@ static void vector_pow43(int *coefs, int len)
>> >> for (i=0; i<len; i++) {
>> >> coef = coefs[i];
>> >> if (coef < 0)
>> >> - coef = -(int)cbrt_tab[-coef];
>> >> + coef = -(int)cbrt_tab[-coef].i;
>> >> else
>> >> - coef = (int)cbrt_tab[coef];
>> >> + coef = (int)cbrt_tab[coef].i;
>> >> coefs[i] = coef;
>> >> }
>> >> }
>> >> diff --git a/libavcodec/aacdec_template.c b/libavcodec/aacdec_template.c
>> >> index d819958..1380510 100644
>> >> --- a/libavcodec/aacdec_template.c
>> >> +++ b/libavcodec/aacdec_template.c
>> >> @@ -1791,7 +1791,7 @@ static int decode_spectrum_and_dequant(AACContext *ac, INTFLOAT coef[1024],
>> >> v = -v;
>> >> *icf++ = v;
>> >> #else
>> >> - *icf++ = cbrt_tab[n] | (bits & 1U<<31);
>> >> + *icf++ = cbrt_tab[n].i | (bits & 1U<<31);
>> >> #endif /* USE_FIXED */
>> >> bits <<= 1;
>> >> } else {
>> >> diff --git a/libavcodec/cbrt_tablegen.h b/libavcodec/cbrt_tablegen.h
>> >> index 59b5a1d..e3d6634 100644
>> >> --- a/libavcodec/cbrt_tablegen.h
>> >> +++ b/libavcodec/cbrt_tablegen.h
>> >> @@ -26,14 +26,13 @@
>> >> #include <stdint.h>
>> >> #include <math.h>
>> >> #include "libavutil/attributes.h"
>> >> +#include "libavutil/intfloat.h"
>> >> #include "libavcodec/aac_defines.h"
>> >>
>> >> -#if USE_FIXED
>> >> -#define CBRT(x) lrint((x).f * 8192)
>> >> -#else
>> >> -#define CBRT(x) x.i
>> >> -#endif
>> >> -
>> >
>> >> +union ff_int32float64 {
>> >> + uint32_t i;
>> >> + double f;
>> >> +};
>> >> #if CONFIG_HARDCODED_TABLES
>> >> #if USE_FIXED
>> >> #define cbrt_tableinit_fixed()
>> >> @@ -43,20 +42,42 @@
>> >> #include "libavcodec/cbrt_tables.h"
>> >> #endif
>> >> #else
>> >> -static uint32_t cbrt_tab[1 << 13];
>> >> +static union ff_int32float64 cbrt_tab[1 << 13];
>> >
>> > this doubles the size of the cpu cache needed at runtime to store
>> > the same number of elements
>>
>> Yes, it does, and it was a tradeoff I made that I forgot to list. One
>> can of course use floats; but this loses accuracy at significant
>> levels.
>>
>> So one could malloc and free a double precision array (for temporary
>> storage) at costs of some code complexity, possible heap
>> fragmentation, and the problem of possible failure (may be ok since
>> anyway aac_decode_init is not guaranteed to succeed; it allocates
>> memory for the dsp context). Malloc/free is AFAIK ~ 100's of cycles,
>> dwarfed by the table generation cost.
>>
>> The problem is that it is impossible to give an answer as to precisely
>> what impact that will have on decoding/encoding performance, and
>> results of course vary based on hardware. This is the same problem
>> that plagues static/dynamic table performance analysis.
>>
>> I don't have a measurable performance regression on my machine for aac
>> decoding because of this. But then, my Haswell setup is not exactly
>> representative.
>
> you can use 2 seperate arrays without union or maybe make the arrays
> part of the union instead of the array elements
Chose the first for lower code complexity; this is what I meant by a
static double array.
Pushed, thanks.
>
> [...]
>
> --
> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> He who knows, does not speak. He who speaks, does not know. -- Lao Tsu
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
More information about the ffmpeg-devel
mailing list