[FFmpeg-devel] [PATCH] avutil/lls: speed up performance of solve_lls
gajjanag at mit.edu
Wed Nov 25 02:32:05 CET 2015
On Tue, Nov 24, 2015 at 8:19 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Mon, Nov 23, 2015 at 08:00:12PM -0500, Ganesh Ajjanagadde wrote:
>> This is a trivial rewrite of the loops that results in better cache
>> efficiency. Essentially, iterating backwards over an array is a bad
>> idea, since it leads to many cache misses: forward iteration fetches a
>> new cache line that gets subsequently used, while backwards iteration fetches the
>> cache line thinking that the indexing would move forwards (the common behavior),
>> resulting in many cache misses.
>> Speedup is nontrivial. Benchmarks obtained by 10^6 iterations within
>> solve_lls, with START/STOP_TIMER. File is tests/data/fate/flac-16-lpc-cholesky.err.
>> Hardware: x86-64, Haswell, GNU/Linux.
>> 12598 decicycles in solve_lls, 2096871 runs, 281 skips
>> 12393 decicycles in solve_lls, 4193889 runs, 415 skips
>> 12250 decicycles in solve_lls, 8387253 runs, 1355 skips
>> 12585 decicycles in solve_lls,16775089 runs, 2127 skips
>> 12785 decicycles in solve_lls,33550272 runs, 4160 skips
>> 12483 decicycles in solve_lls,67101734 runs, 7130 skips
>> 12610 decicycles in solve_lls,134202614 runs, 15114 skips
>> 18101 decicycles in solve_lls, 2096659 runs, 493 skips
>> 17850 decicycles in solve_lls, 4193609 runs, 695 skips
>> 17757 decicycles in solve_lls, 8387458 runs, 1150 skips
>> 17746 decicycles in solve_lls,16775207 runs, 2009 skips
>> 17906 decicycles in solve_lls,33550820 runs, 3612 skips
>> 17931 decicycles in solve_lls,67102891 runs, 5973 skips
>> 17907 decicycles in solve_lls,134206693 runs, 11035 skips
>> Barring asm for this function, there are two (more interesting) potential
>> optimizations for the Cholesky decomposition here:
>> 1. Notice that update_lls is doing a rank one update of the matrix via the var
>> vector. Rank one update of the Cholesky decomposition can be done much more
>> efficiently than a full blown decomposition, see e.g Wikipedia. This will almost
>> certainly require some API thought.
>> 2. LDL' form of the Cholesky factorization offers 2 main advantages:
>> a) Avoiding sqrt and its associated cost, trading it off with extra multiplications.
>> This may or may not be worth the computational cost, though it seems like in
>> FFmpeg, the Cholesky operates on small matrices of dim ~ 3-50, resulting in
>> not too many extra multiplies.
>> b) It provides benefits especially in the poorly conditioned
>> case since one does not have to worry about nan's and associated tainting due
>> to the sqrt.
>> This may or may not require API thought.
>> I personally view 1 as being more unambiguously worthy of exploration at this stage.
>> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
>> libavutil/lls.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
> this significantly changes the output for
> is that intended ?
Wait, I thought there was only one line changed in a diff between the
old and new output, and it was trivial involving a nan. Maybe I posted
the wrong patch, in which case sorry.
> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> I have often repented speaking, but never of holding my tongue.
> -- Xenocrates
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
More information about the ffmpeg-devel