[FFmpeg-devel] [PATCH] avutil/lls: speed up performance of solve_lls

Tue Nov 24 04:28:07 CET 2015

On Mon, Nov 23, 2015 at 8:00 PM, Ganesh Ajjanagadde
<gajjanagadde at gmail.com> wrote:
> This is a trivial rewrite of the loops that results in better cache
> efficiency. Essentially, iterating backwards over an array is a bad
> idea, since it leads to many cache misses: forward iteration fetches a
> new cache line that gets subsequently used, while backwards iteration fetches the
> cache line thinking that the indexing would move forwards (the common behavior),
> resulting in many cache misses.
>
> Speedup is nontrivial. Benchmarks obtained by 10^6 iterations within
> solve_lls, with START/STOP_TIMER. File is tests/data/fate/flac-16-lpc-cholesky.err.
> Hardware: x86-64, Haswell, GNU/Linux.
>
> new:
>   12598 decicycles in solve_lls, 2096871 runs,    281 skips
>   12393 decicycles in solve_lls, 4193889 runs,    415 skips
>   12250 decicycles in solve_lls, 8387253 runs,   1355 skips
>   12585 decicycles in solve_lls,16775089 runs,   2127 skips
>   12785 decicycles in solve_lls,33550272 runs,   4160 skips
>   12483 decicycles in solve_lls,67101734 runs,   7130 skips
>   12610 decicycles in solve_lls,134202614 runs,  15114 skips
>
> old:
>   18101 decicycles in solve_lls, 2096659 runs,    493 skips
>   17850 decicycles in solve_lls, 4193609 runs,    695 skips
>   17757 decicycles in solve_lls, 8387458 runs,   1150 skips
>   17746 decicycles in solve_lls,16775207 runs,   2009 skips
>   17906 decicycles in solve_lls,33550820 runs,   3612 skips
>   17931 decicycles in solve_lls,67102891 runs,   5973 skips
>   17907 decicycles in solve_lls,134206693 runs,  11035 skips
>
> -------------------------------------------------------------------------------
> Barring asm for this function, there are two (more interesting) potential
> optimizations for the Cholesky decomposition here:
> 1. Notice that update_lls is doing a rank one update of the matrix via the var
> vector. Rank one update of the Cholesky decomposition can be done much more
> efficiently than a full blown decomposition, see e.g Wikipedia. This will almost
> certainly require some API thought.
>
> 2. LDL' form of the Cholesky factorization offers 2 main advantages:
> a) Avoiding sqrt and its associated cost, trading it off with extra multiplications.
> This may or may not be worth the computational cost, though it seems like in
> FFmpeg, the Cholesky operates on small matrices of dim ~ 3-50, resulting in
> not too many extra multiplies.
> b) It provides benefits especially in the poorly conditioned
> case since one does not have to worry about nan's and associated tainting due
> to the sqrt.
> This may or may not require API thought.
>
> I personally view 1 as being more unambiguously worthy of exploration at this stage.
>
> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
> ---
>  libavutil/lls.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/libavutil/lls.c b/libavutil/lls.c
> index f77043b..9fd756e 100644
> --- a/libavutil/lls.c
> +++ b/libavutil/lls.c
> @@ -55,7 +55,7 @@ void avpriv_solve_lls(LLSModel *m, double threshold, unsigned short min_order)
>          for (j = i; j < count; j++) {
>              double sum = covar[i][j];
>
> -            for (k = i - 1; k >= 0; k--)
> +            for (k = 0; k <= i-1; k++)
>                  sum -= factor[i][k] * factor[j][k];
>
>              if (i == j) {
> @@ -71,14 +71,14 @@ void avpriv_solve_lls(LLSModel *m, double threshold, unsigned short min_order)
>      for (i = 0; i < count; i++) {
>          double sum = covar_y[i + 1];
>
> -        for (k = i - 1; k >= 0; k--)
> +        for (k = 0; k <= i-1; k++)
>              sum -= factor[i][k] * m->coeff[0][k];
>
>          m->coeff[0][i] = sum / factor[i][i];
>      }
>
>      for (j = count - 1; j >= min_order; j--) {
> -        for (i = j; i >= 0; i--) {
> +        for (i = 0; i <= j; i++) {
>              double sum = m->coeff[0][i];
>
>              for (k = i + 1; k <= j; k++)
> --
> 2.6.2
>

Cache explanation in the message above is wrong, a far more complex
interaction regarding software/hardware prefetching is likely here.
Trouble is my laptop lacks the entry in the BIOS for this, so I don't
know my own settings.  I also don't know whether servers tend to have
this enabled or disabled due to it being ambiguous. Furthermore, these
change with time, compilers, etc.

It may be helpful if someone else can reproduce this speedup, find a
performance regression, or find a correct explanation.