[FFmpeg-devel] [PATCH 13/39] lavc/ffv1: drop redundant PlaneContext.quant_table

Sat Jul 20 12:22:43 EEST 2024

Quoting Michael Niedermayer (2024-07-18 19:40:04)
> On Thu, Jul 18, 2024 at 10:20:09AM +0200, Anton Khirnov wrote:
> > Quoting Michael Niedermayer (2024-07-18 00:32:38)
> > > the data for each decoder task should be together and not scattered around
> > > more than needed, reducing cache efficiency
> > > 
> > > putting all this extra code in the inner per pixel loop is not ok
> > > especially not for the sake of avoiding a memcpy of a few hundread bytes multiple levels of loops outside
> > 
> > A nice theory, but in practice this patchset makes single-threaded
> > decoding about 4% faster overall, on a 1920x1080 10bit sample. That's
> > just the ffv1 parts (up to patch 28), full set also improves frame
> > threading performance as follows:
> > threads         improvement
> > ---------------------------
> > 2                  52% (yes really)
> > 4                  16%
> > 8                  12%
> 
> I do want the speed improvements, yes.
> 
> But
> you compare frame threading when slice threading performed
> much better than frame threading prior to the patch

If that were true in general, there'd be no reason for frame threading
support in ffv1, as it has a higher latency and uses more memory; higher
performance is its only advantage.

However you added frame threading in
a0c0900e470fde0d6db360e555620476c2323895 claiming it is faster, which I
can partially confirm even with current master - slice threading
saturates at thread count = slice count, while frame threading scales
beyond it. Frame threading also improves significantly after this set:

threads | slice    | frame/before | frame/after
-----------------------------------------------
2         22.6124    43.738         22.0354
4         14.3367    15.115         13.1964
6         14.3850    11.974         10.9745
8         14.3472    9.7229         8.76617
10        14.3579    8.4638         8.6499
12        14.3665    8.4636         8.5735
16        14.2960    7.6926         7.1696
-----------------------------------------------
(values are total decode time in seconds)

Note that after this set frame threading is ALWAYS faster than slice
threading, for any thread count.

> also id like to see the individual changes which look like they should
> make teh code slower, to be tested individually. If they make the code slower
> they should be dropped

I don't think it's meaningful to individually benchmark the patches
moving per-slice data into the new per-slice context. I split them to
simplify testing and review, but it only makes sense to apply all of
them or none, otherwise the code gets more complex.

-- 
Anton Khirnov