[FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

Wed Jan 20 15:27:17 CET 2016

On Wed, 20 Jan 2016 13:26:05 +0100, you wrote:

>Hi,
>
>2016-01-19 13:46 GMT+01:00 John Cox <jc at kynesim.co.uk>:
>> I've just done a fair bit of work on hevc_cabac decode for the Rasberry
>> Pi2 and I think that the patch is generally applicable.  Patch is
>> attached but you may prefer to take it from git:
>
>This work is certainly impressive, and most people would have come
>only with some of the "tricks" you used.
>Although it already represents quite a bit of work, I echo others'
>suggestions to have more incremental changes.
>
>> I have not yet run fate over it as I haven't yet finished downloading
>> the samples (the internet connection here isn't wildly fast), but I have
>> run it against the H265.1 conformance streams on both x86 and ARM and it
>> causes no regressions.
>
>Your patch fails on the later fate tests linked to range extensions
>(RExt sequences) on Win64. I didn't investigate why. Random thoughts:
>transform_skip, cross-channel residual, some bypass-coded elements (eg
>SAO).

Yeah - that does fail (and I'm not sure why either at the moment) - I
only tested against the published H.265.1 conformance suite and that
doesn't contain the RExt tests.

Do you believe that master ffmpeg produces the right answer for these
tests?  I didn't spot any RExt logic in the scale code when I rewrote it
(it does affect  how numbers are processed there) and it warns that it
isn't supported when ffmpeg runs.  Having said that I would still have
expected my code to produce the same result as the old code so I'll look
into it.

>> 3) Uses clz which doesn't seem to exist in the ffmpeg int libs (though
>> ctz does)
>
>That could be a patch in and by itself.

Apparently ff_clz is now on master - but wasn't in 2.8 (which is what
RPi need)

>So, referring to your changes, it would be nice to have the following
>changes split in their own patches:
>1) significant coeff flag decoding, which probably is the largest gain
>(and therefore would be even nicer if further sliced):
>  a) for instance, you avoid an indirection by flattening/merging
>context tables;
>  b) other parts, which I fear may not translate that well for other
>platforms (at least without matching x86 code), or sequences
>2) you use native sized integers in some places (not sure if that can
>cause issues);
>3) bypass-coded stuff is a fairly large change (both in terms of code,
>review and impacting the cabac struct also used by h264); it would be
>nice knowing how much you gain here
>4) the replacing of !!something by something when the flag is already 0/1
>5) coefficient saturation

I don't have formal numbers for everything but from the profiling I did
in development:

The by22 code gained me an overall factor of two in the abs level decode
- the gains do depend a lot on the quantity of residual - you gain a lot
more on I-frames than you do otherwise as they tend to have much longer
residuals.  The higher the bitrate the more useful this code is.  But as
you note it didn't use vast amounts of time relative to everything else
anyway.

The reworking / simplification of the loop(s) around the abs level
decode and the scaling gave me the biggest single improvement.

After that the reworking of get_sig_ceoff_flag_idxs was a useful gain

Special caseing the single coeff path gave a similar gain

After that the scale rework - now probably 75% faster than it was
previously but it wasn't taking a huge amount of time.

And after that all the other bits - my experience with optimising this
sort of code (I did a lot of work on a TI H.264 implementation in the
past) is that no single change is going to do everything, you just have
to polish everything until it goes fast enough.

>3) is indeed the largest chunk. I don't know what your profiling
>indicated, but the original code didn't seem that high-profile. But I
>haven't split it to see what it actually provided, but overall numbers
>look good:
>
>I quickly hacked (quickly being the keyword as it also means poor and
>potentially resulting in faulty conclusion) something that is close to
>2) + 4) for reference.
>Benching REF+1)a) vs REF+1), it did seem slower on Win64/Haswell for
>significant flag decoding by a few cycles (around 1% of the codeblock)
>Benching REF+1)a) vs your patch, I see around 3% improvement with
>something that is fairly more optimized overall than ffmpeg's master,
>ie ff_hevc_hls_residual_coding is a lot more prevalent, which is
>probably also the case in your rpi2 benchmarks.

Sorry - I don't quite understand what you've said here.

>Note: I don't think I'll review next iterations of the patch(set) with
>any shape of diligence, but some of the above parts (1.a, 4 and 5) are
>ok if not the cause of the fate issues.
>
>Best regards,

Thanks

JC