[FFmpeg-devel] [PATCH] avcodec/aac_tablegen: speed up table initialization

Thu Nov 26 23:11:15 CET 2015

On Thu, Nov 26, 2015 at 4:31 PM, Ganesh Ajjanagadde
<gajjanagadde at gmail.com> wrote:
> This speeds up aac_tablegen to a ludicruous degree (~97%), i.e to the point
> where it can be argued that runtime initialization can always be done instead of
> hard-coded tables. The only cost is essentially a trivial increase in
> the stack size.
>
> Even if one does not care about this, the patch also improves accuracy
> as detailed below.
>
> Performance:
> Benchmark obtained by looping 10^4 times over ff_aac_tableinit.
>
> Sample benchmark (x86-64, Haswell, GNU/Linux):
> old:
> 1295292 decicycles in ff_aac_tableinit,     512 runs,      0 skips
> 1275981 decicycles in ff_aac_tableinit,    1024 runs,      0 skips
> 1272932 decicycles in ff_aac_tableinit,    2048 runs,      0 skips
> 1262164 decicycles in ff_aac_tableinit,    4096 runs,      0 skips
> 1256720 decicycles in ff_aac_tableinit,    8192 runs,      0 skips
>
> new:
> 25691 decicycles in ff_aac_tableinit,     505 runs,      7 skips
> 25130 decicycles in ff_aac_tableinit,    1016 runs,      8 skips
> 25973 decicycles in ff_aac_tableinit,    2036 runs,     12 skips
> 25911 decicycles in ff_aac_tableinit,    4078 runs,     18 skips
> 25816 decicycles in ff_aac_tableinit,    8154 runs,     38 skips
>
> Accuracy:
> The previous code was resulting in needless loss of
> accuracy due to the pow being called in succession. As an illustration
> of this:
> ff_aac_pow34sf_tab[3]
> old : 0.000000000007598092294225
> new : 0.000000000007598091426864
> real: 0.000000000007598091778545
>
> truncated to float
> old : 0.000000000007598092294225
> new : 0.000000000007598091426864
> real: 0.000000000007598091426864
>
> showing that the old value was not correctly rounded. This affects a
> large number of elements of the array.
>
> Patch tested with FATE.
>
> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
> ---
>  libavcodec/aac_tablegen.h | 38 ++++++++++++++++++++++++++++++++++++--
>  1 file changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/libavcodec/aac_tablegen.h b/libavcodec/aac_tablegen.h
> index 8b223f9..255723b 100644
> --- a/libavcodec/aac_tablegen.h
> +++ b/libavcodec/aac_tablegen.h
> @@ -35,9 +35,43 @@ float ff_aac_pow34sf_tab[428];
>  av_cold void ff_aac_tableinit(void)
>  {
>      int i;
> +
> +    /* 2^(i/16) for 0 <= i <= 15 */
> +    const double exp2_lut[] = {
> +        1.00000000000000000000,
> +        1.04427378242741384032,
> +        1.09050773266525765921,
> +        1.13878863475669165370,
> +        1.18920711500272106672,
> +        1.24185781207348404859,
> +        1.29683955465100966593,
> +        1.35425554693689272830,
> +        1.41421356237309504880,
> +        1.47682614593949931139,
> +        1.54221082540794082361,
> +        1.61049033194925430818,
> +        1.68179283050742908606,
> +        1.75625216037329948311,
> +        1.83400808640934246349,
> +        1.91520656139714729387,
> +    };
> +    double t1 = 8.8817841970012523233890533447265625e-16; // 2^(-50)
> +    double t2 = 3.63797880709171295166015625e-12; // 2^(-38)
> +    int t1_inc_cur, t2_inc_cur;
> +    int t1_inc_prev = 0;
> +    int t2_inc_prev = 8;
> +
>      for (i = 0; i < 428; i++) {
> -        ff_aac_pow2sf_tab[i] = pow(2, (i - POW_SF2_ZERO) / 4.0);
> -        ff_aac_pow34sf_tab[i] = pow(ff_aac_pow2sf_tab[i], 3.0/4.0);
> +        t1_inc_cur = 4 * (i % 4);
> +        t2_inc_cur = (8 + 3*i) % 16;
> +        if (t1_inc_cur < t1_inc_prev)
> +            t1 *= 2;
> +        if (t2_inc_cur < t2_inc_prev)
> +            t2 *= 2;
> +        ff_aac_pow2sf_tab[i] = t1 * exp2_lut[t1_inc_cur];
> +        ff_aac_pow34sf_tab[i] = t2 * exp2_lut[t2_inc_cur];
> +        t1_inc_prev = t1_inc_cur;
> +        t2_inc_prev = t2_inc_cur;
>      }
>  }
>  #endif /* CONFIG_HARDCODED_TABLES */
> --
> 2.6.2
>

BTW, further speedup (from ~25000 to ~20000 decicycles) that turns out
to not change any of the table values (from the more accurate new
ones) may be obtained by changing t1, t2, and the lut to float. This
also reduces the stack size.