[FFmpeg-devel] [HACK] 50% faster H.264 decoding

Luca Barbato lu_zero
Sun Aug 22 03:50:23 CEST 2010


On 08/19/2010 10:46 PM, Ronald S. Bultje wrote:
> Hi,
> 
> On Thu, Aug 19, 2010 at 4:34 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>> On Thu, Aug 19, 2010 at 12:55 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>>> On Thu, Aug 19, 2010 at 9:56 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>>>> On Wed, Aug 18, 2010 at 6:44 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>>>>> On Wed, Aug 18, 2010 at 6:28 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>>>>> On Wed, Aug 18, 2010 at 12:42:11PM -0400, Ronald S. Bultje wrote:
>>>>>>> On Tue, Aug 17, 2010 at 1:35 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>>>>>>> On Tue, Aug 17, 2010 at 11:01:03AM -0400, Ronald S. Bultje wrote:
>>>>>>>>> On Mon, Aug 16, 2010 at 6:40 PM, Jason Garrett-Glaser
>>>>>>>>> <darkshikari at gmail.com> wrote:
>>>>>>>>>> On Mon, Aug 16, 2010 at 3:35 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 11, 2010 at 5:32 PM, Jason Garrett-Glaser
>>>>>>>>>>> <darkshikari at gmail.com> wrote:
>>>>>>>>>>>> 13. Use MPEG-2 MC for chroma MC, since we know that MVs are
>>>>>>>>>>>> fullpel-only.  Simplify edge emulation stuff accordingly too.
>>>>>>>>>>>
>>>>>>>>>>> Does h264 chroma subpel actually use a memcpy shortcut if it's
>>>>>>>>>>> fullpel? I don't remember exactly, but I don't think it has such a
>>>>>>>>>>> shortcut for chroma, only for luma.
>>>>>>>>>>
>>>>>>>>>> It doesn't.  It should at least have a shortcut for the 0,0 motion
>>>>>>>>>> vector because its very high probability (relative to other fullpel
>>>>>>>>>> motion vectors that result in no chroma interpolation).  For other
>>>>>>>>>> cases, it might or might not be worthwhile to add a branch in the asm
>>>>>>>>>> to the 1D-only case.
>>>>>>>>>
>>>>>>>>> Attached sets up framework for that. The [0] functions can be copied
>>>>>>>>> straight from VP8 (they are pixel_copy functions, with very fast
>>>>>>>>> aligned implementations for all relevant archs) and others, and should
>>>>>>>>> make VC-1, RV3/4, h264, H264/MPEG etc. significantly faster for the
>>>>>>>>> MVxy==0 case. The [1]/[2] functions are probably going to be faster as
>>>>>>>>> well but that would need some testing to see how big the effect is.
>>>>>>>>> [3] is the function as-is now, which should obviously stay the way it
>>>>>>>>> is.
>>>>>>>>>
>>>>>>>>> Michael, OK to apply this? It's mostly just changing all kind of files
>>>>>>>>
>>>>>>>> if its not slower ...
>>>>>>>
>>>>>>> Same speed. Attached is an updated version that fixes a bug in one of
>>>>>>> the fate samples where mx gets changed and thus we called the wrong
>>>>>>> version.
>>>>>>>
>>>>>>> I've tested this version with a semi-finished patch that splits up the
>>>>>>> h264 chroma MC functions (particularly the mc8 ones) into smaller
>>>>>>> ones, thus having cleaner (and unbranched) handling of mx==0/my==0.
>>>>>>> This will remove most (if not all) of the branching, which might give
>>>>>>> a minor speedup, and also removes a little duplicate code (in the
>>>>>>> binary, not source), e.g. the fullpel handling between
>>>>>>> mmx/3dnow/mmx2/ssse3 rv40/h264/vc1 mc8 is identical (it's all
>>>>>>> put_pixels8_mmx) and only needs a single function. I'm only doing this
>>>>>>> for the C and x86 ones because I can't test any of the others.
>>>>>>>
>>>>>>> After that's done, I plan to do a third patch which will add fullpel
>>>>>>> or 1D-filter versions for mc4/mc2 as well, which should actually
>>>>>>> provide a speedup for code on our desktops, as we saw for Jason's
>>>>>>> hackpatch.
>>>>>>>
>>>>>>> Ronald
>>>>>>
>>>>>>>  arm/dsputil_init_neon.c |   32 ++++++++++---
>>>>>>>  cavs.c                  |   13 ++---
>>>>>>>  dsputil.c               |   40 +++++++++++++---
>>>>>>>  dsputil.h               |   12 ++--
>>>>>>>  h264.c                  |   24 +++++----
>>>>>>>  mpegvideo.c             |   28 ++++++-----
>>>>>>>  ppc/h264_altivec.c      |   20 ++++++--
>>>>>>>  rv34.c                  |    9 ++-
>>>>>>>  rv40dsp.c               |   20 ++++++--
>>>>>>>  sh4/dsputil_align.c     |   30 +++++++++---
>>>>>>>  vc1dec.c                |   33 +++++++------
>>>>>>>  vp6.c                   |    6 +-
>>>>>>>  x86/dsputil_mmx.c       |  118 +++++++++++++++++++++++++++++++++++++-----------
>>>>>>>  13 files changed, 272 insertions(+), 113 deletions(-)
>>>>>>> 183027123a1213b2e037504a01d87c9c0678c1db  h264-chroma-mvzero-shortcut.patch
>>>>>>
>>>>>> no objections
>>>>>
>>>>> Attached are the follow-up patches, C-only for now (still working on the asm).
>>>>>
>>>>> Patch #1 splits the H264 macro function creation macros into two, and
>>>>> makes vc1_no_rnd use this macro instead of re-doing its own version of
>>>>> it. Patch somehow thinks I changed mc2 into mc8, mc4 into mc2 and mc8
>>>>> into mc4, rather than seeing I moved mc8 up from below, but the patch
>>>>> should be readable nevertheless.
>>>>>
>>>>> Patch #2 then splits the C functions into 3: one each for x=0 or y=0,
>>>>> and the remaining one for 2D bilinear filtering. It also adds one for
>>>>> the case where x=0 AND y=0 (direct copy). Make fate has no objections.
>>>>> There is no speed change for 1D/2D. The direct copy would be expected
>>>>> to be faster but I didn't test because the C code isn't that relevant.
>>>>> I can test if you prefer, but I'd rather focus on the asm functions
>>>>> and make sure every change there is speed-tested. If you want, I can
>>>>> move the adding of the direct copy functions to a separate patch, but
>>>>> I didn't think that was necessary.
>>>>>
>>>>> I will do similar splits to the asm code
>>>> [..]
>>>>
>>>> And these can be found in attached. Iv'e checked make fate for MMX,
>>>> MMX2 and SSSE3 and all is identical. I will do some basic performance
>>>> checks to make sure I didn't screw up anything, but speed should be
>>>> identical except maybe for MMX avg_mc8 for x=0&&y=0, which is added by
>>>> this patch (it was pretty much a one-liner). This is generally not
>>>> used since MMX2/3DNOW versions are available also. If wanted, I can
>>>> separate this or remove it.
>>>>
>>>> Next step is to actually implement new functions for 1D/no-filter
>>>> mc4/mc2 which leads to the actually wanted speedup.
>>>
>>> Example of such an optimization attached, so we can start applying
>>> this whole thing (now that I'm showing an actual improvement in
>>> performance :-) ).
>>>
>>> START/STOP_TIMER around chroma_op[]() in h264.c, measuring only the
>>> case where mx=0, my=0 and chroma_function_index=1 (local hack). CPU is
>>> Intel Core i7 (Macbook Pro, OSX 10.6.4). GCC:
>>> i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664).
>>> Sample: /Users/ronaldbultje/Movies/fate-suite/h264-conformance/MR3_TANDBERG_B.264
>>>
>>> after:
>>> 1925 dezicycles in w=4,mx=0,my=0, 2 runs, 0 skips
>>> 2075 dezicycles in w=4,mx=0,my=0, 4 runs, 0 skips
>>> 2445 dezicycles in w=4,mx=0,my=0, 8 runs, 0 skips
>>> 1903 dezicycles in w=4,mx=0,my=0, 16 runs, 0 skips
>>> 1792 dezicycles in w=4,mx=0,my=0, 32 runs, 0 skips
>>> 1609 dezicycles in w=4,mx=0,my=0, 64 runs, 0 skips
>>>
>>> before (here it would use the 2D filter ssse3 code):
>>> 2990 dezicycles in w=4,mx=0,my=0, 2 runs, 0 skips
>>> 2850 dezicycles in w=4,mx=0,my=0, 4 runs, 0 skips
>>> 2917 dezicycles in w=4,mx=0,my=0, 8 runs, 0 skips
>>> 2623 dezicycles in w=4,mx=0,my=0, 16 runs, 0 skips
>>> 2505 dezicycles in w=4,mx=0,my=0, 32 runs, 0 skips
>>> 2518 dezicycles in w=4,mx=0,my=0, 64 runs, 0 skips
>>>
>>> C-only (the version after my patches applied, so the 32-bit direct
>>> read/write loop):
>>> 5230 dezicycles in w=4,mx=0,my=0, 2 runs, 0 skips
>>> 5215 dezicycles in w=4,mx=0,my=0, 4 runs, 0 skips
>>> 5755 dezicycles in w=4,mx=0,my=0, 8 runs, 0 skips
>>> 4255 dezicycles in w=4,mx=0,my=0, 16 runs, 0 skips
>>> 3819 dezicycles in w=4,mx=0,my=0, 32 runs, 0 skips
>>> 3772 dezicycles in w=4,mx=0,my=0, 64 runs, 0 skips
>>
>> By popular request, here's one that adds the new code to
>> dsputil_yasm.asm instead of dsputil_mmx.c. Now I can actually read my
>> own code, too. make fate-h264 didn't complain about this change.

Keep in mind that some compilers (like open64) might support inline asm
and cannot do link time optimizations on yasm generated binaries...

lu

-- 

Luca Barbato
Gentoo/linux
http://dev.gentoo.org/~lu_zero




More information about the ffmpeg-devel mailing list