[FFmpeg-devel] [PATCH 13/39] lavc/ffv1: drop redundant PlaneContext.quant_table
Anton Khirnov
anton at khirnov.net
Sat Jul 20 12:22:43 EEST 2024
Quoting Michael Niedermayer (2024-07-18 19:40:04)
> On Thu, Jul 18, 2024 at 10:20:09AM +0200, Anton Khirnov wrote:
> > Quoting Michael Niedermayer (2024-07-18 00:32:38)
> > > the data for each decoder task should be together and not scattered around
> > > more than needed, reducing cache efficiency
> > >
> > > putting all this extra code in the inner per pixel loop is not ok
> > > especially not for the sake of avoiding a memcpy of a few hundread bytes multiple levels of loops outside
> >
> > A nice theory, but in practice this patchset makes single-threaded
> > decoding about 4% faster overall, on a 1920x1080 10bit sample. That's
> > just the ffv1 parts (up to patch 28), full set also improves frame
> > threading performance as follows:
> > threads improvement
> > ---------------------------
> > 2 52% (yes really)
> > 4 16%
> > 8 12%
>
> I do want the speed improvements, yes.
>
> But
> you compare frame threading when slice threading performed
> much better than frame threading prior to the patch
If that were true in general, there'd be no reason for frame threading
support in ffv1, as it has a higher latency and uses more memory; higher
performance is its only advantage.
However you added frame threading in
a0c0900e470fde0d6db360e555620476c2323895 claiming it is faster, which I
can partially confirm even with current master - slice threading
saturates at thread count = slice count, while frame threading scales
beyond it. Frame threading also improves significantly after this set:
threads | slice | frame/before | frame/after
-----------------------------------------------
2 22.6124 43.738 22.0354
4 14.3367 15.115 13.1964
6 14.3850 11.974 10.9745
8 14.3472 9.7229 8.76617
10 14.3579 8.4638 8.6499
12 14.3665 8.4636 8.5735
16 14.2960 7.6926 7.1696
-----------------------------------------------
(values are total decode time in seconds)
Note that after this set frame threading is ALWAYS faster than slice
threading, for any thread count.
> also id like to see the individual changes which look like they should
> make teh code slower, to be tested individually. If they make the code slower
> they should be dropped
I don't think it's meaningful to individually benchmark the patches
moving per-slice data into the new per-slice context. I split them to
simplify testing and review, but it only makes sense to apply all of
them or none, otherwise the code gets more complex.
--
Anton Khirnov
More information about the ffmpeg-devel
mailing list