[FFmpeg-devel] [PATCH] avutil/lls: speed up performance of solve_lls

Wed Nov 25 02:19:51 CET 2015

On Mon, Nov 23, 2015 at 08:00:12PM -0500, Ganesh Ajjanagadde wrote:
> This is a trivial rewrite of the loops that results in better cache
> efficiency. Essentially, iterating backwards over an array is a bad
> idea, since it leads to many cache misses: forward iteration fetches a
> new cache line that gets subsequently used, while backwards iteration fetches the
> cache line thinking that the indexing would move forwards (the common behavior),
> resulting in many cache misses.
> 
> Speedup is nontrivial. Benchmarks obtained by 10^6 iterations within
> solve_lls, with START/STOP_TIMER. File is tests/data/fate/flac-16-lpc-cholesky.err.
> Hardware: x86-64, Haswell, GNU/Linux.
> 
> new:
>   12598 decicycles in solve_lls, 2096871 runs,    281 skips
>   12393 decicycles in solve_lls, 4193889 runs,    415 skips
>   12250 decicycles in solve_lls, 8387253 runs,   1355 skips
>   12585 decicycles in solve_lls,16775089 runs,   2127 skips
>   12785 decicycles in solve_lls,33550272 runs,   4160 skips
>   12483 decicycles in solve_lls,67101734 runs,   7130 skips
>   12610 decicycles in solve_lls,134202614 runs,  15114 skips
> 
> old:
>   18101 decicycles in solve_lls, 2096659 runs,    493 skips
>   17850 decicycles in solve_lls, 4193609 runs,    695 skips
>   17757 decicycles in solve_lls, 8387458 runs,   1150 skips
>   17746 decicycles in solve_lls,16775207 runs,   2009 skips
>   17906 decicycles in solve_lls,33550820 runs,   3612 skips
>   17931 decicycles in solve_lls,67102891 runs,   5973 skips
>   17907 decicycles in solve_lls,134206693 runs,  11035 skips
> 
> -------------------------------------------------------------------------------
> Barring asm for this function, there are two (more interesting) potential
> optimizations for the Cholesky decomposition here:
> 1. Notice that update_lls is doing a rank one update of the matrix via the var
> vector. Rank one update of the Cholesky decomposition can be done much more
> efficiently than a full blown decomposition, see e.g Wikipedia. This will almost
> certainly require some API thought.
> 
> 2. LDL' form of the Cholesky factorization offers 2 main advantages:
> a) Avoiding sqrt and its associated cost, trading it off with extra multiplications.
> This may or may not be worth the computational cost, though it seems like in
> FFmpeg, the Cholesky operates on small matrices of dim ~ 3-50, resulting in
> not too many extra multiplies.
> b) It provides benefits especially in the poorly conditioned
> case since one does not have to worry about nan's and associated tainting due
> to the sqrt.
> This may or may not require API thought.
> 
> I personally view 1 as being more unambiguously worthy of exploration at this stage.
> 
> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
> ---
>  libavutil/lls.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

this significantly changes the output for
libavutil/lls-test

is that intended ?

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I have often repented speaking, but never of holding my tongue.
-- Xenocrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20151125/ac9c7194/attachment.sig>