[FFmpeg-devel] Parallelized h264 proof-of-concept

Wed Jun 6 12:03:58 CEST 2007

Hi

Here's an version rewritten from scratch.

Michael Niedermayer wrote:
> Hi
> 
> On Fri, May 18, 2007 at 11:00:57PM +0200, Andreas ?man wrote:
> [...]
>> The issues left to fix are:
>>
>> o The error resilience data structures are not protected (but
>>   still shared). This usually manifests itself into:
>>
>> [h264 @ 0xb7c64208]concealing 0 DC, 0 AC, 0 MV errors
>>
>>   because the s->error_count decrement races between
>>   cpus. This is pretty easy to fix if the avcodec thread
>>   implementations would expose a locking primitive.
> 
> you dont need any locking, just n seperate error_counts, one for each
> thread, and then sum them at the end

Fixed

>> o deblocking doesn't work correctly. When deblocking is enabled
>>   the md5 sum output from my test program changes for every run.
>>   I quite sure this is caused by the fact that deblocking is done
>>   over the entire frame, not locally per slice, and thus, if
>>   slices complete out-of-order, there will be errors.
>>   I don't see any visual artifacts, but something is fishy for
>>   sure. I'll need to nail the exact reason before i can be
>>   more specific about problems / solutions here.
> 
> this is serious, md5 must match ...

Fixed.
Notice that this patch does not enable multi-threading if
deblocking type == 1.
I'm gonna look into if it's worth postponing type 1 deblocking
to after the frame is decoded when running with multi-threading.
Also, take a look if it is possible to parallelize deblocking
itself (by doing it in diagonal strokes or somthing.. i donno yet)

> 
> 
>> o The SVQ3 decoding has not yet been adapted. (one need to configure
>>   with --disable-decoder=svq3 to compile at all now)
> 
> that too is serious, nothing may break ... though theres no need to make
> SVQ3 multithreaded too ...

Fixed

> 
> 
> [...]
>> Okay, a few words about the changes.
>>
>> A new structure H264Thread (name suggestions very welcome) is
>> passed around to almost all functions. This structure is
>> local for every slice (perhaps H264Slice would be a better
>> name) and contains all members from H264Context that
>> changed during slice decode. I also moved a few things
>> (most notably mb_[xy]) from MpegEncContext here.
> 
> what about copying the MpegEncContect & H264Context for each thread
> and using them, this should significantly reduce the changes
> needed (note i didnt look at your patch at all ...)

Yeps, thats how it'd done now.

There is some uglyness after MPV_common_init() since the
threads allocated are sizeof(MpegEncContext).

A simple av_realloc() dosent work since it does not correctly align
stuff when CONFIG_MEMALIGN

I see a few options here,

* Pass a second argument to MPV_common_init()
* Let MPV_common_init() look at some pre-initialized field in
's' (s->super_context_size) or somthing...
* "Fix" av_realloc to correctly align (by using free + memaling +
memcpy)
* Any other ideas?

> 
> also look at how slice level multithreading is implemented for
> mpeg2/mpeg4 ...

>> Anyway,
>> If this is something that ffmpeg is willing to integrate
>> I'd like to get a few pointers, hints and answers on the
>> topics above before I continue with the stuff that's left.
> 
> iam not against slice level threading support, though the
> implementation must be clean, simple and there must be no
> speedloss for the single threaded case (>1% is completely
> unacceptable)
> 

This version is much cleaner, there are some "unrelated"
changes (border backup + copy stuff) that might be beneficial
to commit anyway (but the deblocking-type-2 conditional in xchg must
be there in order for deblocking to work correctly when run in parallel)

I've done a couple of tests which two streams.

There are no longer any slow-down compared to an unmodified version
of ffmpeg from head.

Each test was 1000 frames from the two streams, 10 tests were run
and the 6 best average times from av_decode_video() has been
averaged into the 'Time' column

File A, CABAC, 6 slices, deblocking type 2
File B, CAVLC, 8 slices, deblocking type 0 + 1

Content  ffmpeg       CPU                     Concurrency       Time
---------------------------------------------------------------------
File A   unmodified   3GHz Xeon HT            n/a               16211
File A   patched      3GHz Xeon HT            1                 16113
File A   patched      3GHz Xeon HT            2                 15594
File A   patched      2.66GHz 4way Xeon HT    8                 4401
File A   unmodified   1.73GHz Pentium-M       n/a               15609
File A   patched      1.73GHz Pentium-M       1                 15538
File A   unmodified   2.13GHz Core2 duo       n/a               11148
File A   patched      2.13GHz Core2 duo       1                 11019
File A   patched      2.13GHz Core2 duo       2                 7168

File B   unmodified   3GHz Xeon HT            n/a               30286
File B   patched      3GHz Xeon HT            1                 29993
File B   patched      3GHz Xeon HT            2                 25913
File B   patched      2.66GHz 4way Xeon HT    8                 5129
File B   unmodified   1.73GHz Pentium-M       n/a               26892
File B   patched      1.73GHz Pentium-M       1                 26777
File B   unmodified   2.13GHz Core2 duo       n/a               19938
File B   patched      2.13GHz Core2 duo       1                 19681
File B   patched      2.13GHz Core2 duo       2                 11458

MD5 sums matches from all tests. (If anyone want to, i can post
test output with md5 sums aswell)

I've also run some long-time tests on the 8way system to make
sure there are no race conditions around.

Comments are of course welcome...

-------------- next part --------------
A non-text attachment was scrubbed...
Name: h264-parallel-take2.diff
Type: text/x-patch
Size: 30364 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070606/057bbc86/attachment.bin>