[FFmpeg-devel] [PATCH]levc/hevc_cabac Optimise ff_hevc_hls_residual_coding (especially ARM)

Wed Jan 20 13:26:05 CET 2016

Hi,

2016-01-19 13:46 GMT+01:00 John Cox <jc at kynesim.co.uk>:
> I've just done a fair bit of work on hevc_cabac decode for the Rasberry
> Pi2 and I think that the patch is generally applicable.  Patch is
> attached but you may prefer to take it from git:

This work is certainly impressive, and most people would have come
only with some of the "tricks" you used.
Although it already represents quite a bit of work, I echo others'
suggestions to have more incremental changes.

> I have not yet run fate over it as I haven't yet finished downloading
> the samples (the internet connection here isn't wildly fast), but I have
> run it against the H265.1 conformance streams on both x86 and ARM and it
> causes no regressions.

Your patch fails on the later fate tests linked to range extensions
(RExt sequences) on Win64. I didn't investigate why. Random thoughts:
transform_skip, cross-channel residual, some bypass-coded elements (eg
SAO).

> 3) Uses clz which doesn't seem to exist in the ffmpeg int libs (though
> ctz does)

That could be a patch in and by itself.

So, referring to your changes, it would be nice to have the following
changes split in their own patches:
1) significant coeff flag decoding, which probably is the largest gain
(and therefore would be even nicer if further sliced):
  a) for instance, you avoid an indirection by flattening/merging
context tables;
  b) other parts, which I fear may not translate that well for other
platforms (at least without matching x86 code), or sequences
2) you use native sized integers in some places (not sure if that can
cause issues);
3) bypass-coded stuff is a fairly large change (both in terms of code,
review and impacting the cabac struct also used by h264); it would be
nice knowing how much you gain here
4) the replacing of !!something by something when the flag is already 0/1
5) coefficient saturation

3) is indeed the largest chunk. I don't know what your profiling
indicated, but the original code didn't seem that high-profile. But I
haven't split it to see what it actually provided, but overall numbers
look good:

I quickly hacked (quickly being the keyword as it also means poor and
potentially resulting in faulty conclusion) something that is close to
2) + 4) for reference.
Benching REF+1)a) vs REF+1), it did seem slower on Win64/Haswell for
significant flag decoding by a few cycles (around 1% of the codeblock)
Benching REF+1)a) vs your patch, I see around 3% improvement with
something that is fairly more optimized overall than ffmpeg's master,
ie ff_hevc_hls_residual_coding is a lot more prevalent, which is
probably also the case in your rpi2 benchmarks.

Note: I don't think I'll review next iterations of the patch(set) with
any shape of diligence, but some of the above parts (1.a, 4 and 5) are
ok if not the cause of the fate issues.

Best regards,
-- 
Christophe