[FFmpeg-devel] [HACK] 50% faster H.264 decoding

Thu Aug 19 00:44:33 CEST 2010

Hi,

On Wed, Aug 18, 2010 at 6:28 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Wed, Aug 18, 2010 at 12:42:11PM -0400, Ronald S. Bultje wrote:
>> On Tue, Aug 17, 2010 at 1:35 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Tue, Aug 17, 2010 at 11:01:03AM -0400, Ronald S. Bultje wrote:
>> >> On Mon, Aug 16, 2010 at 6:40 PM, Jason Garrett-Glaser
>> >> <darkshikari at gmail.com> wrote:
>> >> > On Mon, Aug 16, 2010 at 3:35 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>> >> >> Hi,
>> >> >>
>> >> >> On Wed, Aug 11, 2010 at 5:32 PM, Jason Garrett-Glaser
>> >> >> <darkshikari at gmail.com> wrote:
>> >> >>> 13. Use MPEG-2 MC for chroma MC, since we know that MVs are
>> >> >>> fullpel-only. ?Simplify edge emulation stuff accordingly too.
>> >> >>
>> >> >> Does h264 chroma subpel actually use a memcpy shortcut if it's
>> >> >> fullpel? I don't remember exactly, but I don't think it has such a
>> >> >> shortcut for chroma, only for luma.
>> >> >
>> >> > It doesn't. ?It should at least have a shortcut for the 0,0 motion
>> >> > vector because its very high probability (relative to other fullpel
>> >> > motion vectors that result in no chroma interpolation). ?For other
>> >> > cases, it might or might not be worthwhile to add a branch in the asm
>> >> > to the 1D-only case.
>> >>
>> >> Attached sets up framework for that. The [0] functions can be copied
>> >> straight from VP8 (they are pixel_copy functions, with very fast
>> >> aligned implementations for all relevant archs) and others, and should
>> >> make VC-1, RV3/4, h264, H264/MPEG etc. significantly faster for the
>> >> MVxy==0 case. The [1]/[2] functions are probably going to be faster as
>> >> well but that would need some testing to see how big the effect is.
>> >> [3] is the function as-is now, which should obviously stay the way it
>> >> is.
>> >>
>> >> Michael, OK to apply this? It's mostly just changing all kind of files
>> >
>> > if its not slower ...
>>
>> Same speed. Attached is an updated version that fixes a bug in one of
>> the fate samples where mx gets changed and thus we called the wrong
>> version.
>>
>> I've tested this version with a semi-finished patch that splits up the
>> h264 chroma MC functions (particularly the mc8 ones) into smaller
>> ones, thus having cleaner (and unbranched) handling of mx==0/my==0.
>> This will remove most (if not all) of the branching, which might give
>> a minor speedup, and also removes a little duplicate code (in the
>> binary, not source), e.g. the fullpel handling between
>> mmx/3dnow/mmx2/ssse3 rv40/h264/vc1 mc8 is identical (it's all
>> put_pixels8_mmx) and only needs a single function. I'm only doing this
>> for the C and x86 ones because I can't test any of the others.
>>
>> After that's done, I plan to do a third patch which will add fullpel
>> or 1D-filter versions for mc4/mc2 as well, which should actually
>> provide a speedup for code on our desktops, as we saw for Jason's
>> hackpatch.
>>
>> Ronald
>
>> ?arm/dsputil_init_neon.c | ? 32 ++++++++++---
>> ?cavs.c ? ? ? ? ? ? ? ? ?| ? 13 ++---
>> ?dsputil.c ? ? ? ? ? ? ? | ? 40 +++++++++++++---
>> ?dsputil.h ? ? ? ? ? ? ? | ? 12 ++--
>> ?h264.c ? ? ? ? ? ? ? ? ?| ? 24 +++++----
>> ?mpegvideo.c ? ? ? ? ? ? | ? 28 ++++++-----
>> ?ppc/h264_altivec.c ? ? ?| ? 20 ++++++--
>> ?rv34.c ? ? ? ? ? ? ? ? ?| ? ?9 ++-
>> ?rv40dsp.c ? ? ? ? ? ? ? | ? 20 ++++++--
>> ?sh4/dsputil_align.c ? ? | ? 30 +++++++++---
>> ?vc1dec.c ? ? ? ? ? ? ? ?| ? 33 +++++++------
>> ?vp6.c ? ? ? ? ? ? ? ? ? | ? ?6 +-
>> ?x86/dsputil_mmx.c ? ? ? | ?118 +++++++++++++++++++++++++++++++++++++-----------
>> ?13 files changed, 272 insertions(+), 113 deletions(-)
>> 183027123a1213b2e037504a01d87c9c0678c1db ?h264-chroma-mvzero-shortcut.patch
>
> no objections

Attached are the follow-up patches, C-only for now (still working on the asm).

Patch #1 splits the H264 macro function creation macros into two, and
makes vc1_no_rnd use this macro instead of re-doing its own version of
it. Patch somehow thinks I changed mc2 into mc8, mc4 into mc2 and mc8
into mc4, rather than seeing I moved mc8 up from below, but the patch
should be readable nevertheless.

Patch #2 then splits the C functions into 3: one each for x=0 or y=0,
and the remaining one for 2D bilinear filtering. It also adds one for
the case where x=0 AND y=0 (direct copy). Make fate has no objections.
There is no speed change for 1D/2D. The direct copy would be expected
to be faster but I didn't test because the C code isn't that relevant.
I can test if you prefer, but I'd rather focus on the asm functions
and make sure every change there is speed-tested. If you want, I can
move the adding of the direct copy functions to a separate patch, but
I didn't think that was necessary.

I will do similar splits to the asm code, and add direct copy or 1D
filter functions for mc4/mc2 (currently, these only exist for mc8).

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dsputil-make-vc1_no_rnd-use-generic-macro.patch
Type: application/octet-stream
Size: 7902 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100818/4577e123/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h264-split-zeromv-c.patch
Type: application/octet-stream
Size: 16352 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100818/4577e123/attachment-0001.obj>