[FFmpeg-devel] h264 threading fate tests

Mon Oct 1 11:50:19 CEST 2012

On Fri, Sep 28, 2012 at 11:07:54AM +0200, Clément Bœsch wrote:
> On Mon, Apr 16, 2012 at 04:43:42PM +0200, Clément Bœsch wrote:
> > Hi,
> > 
> > I recently setup a few fate instances to test the threading (2, 8, 16 and
> > auto), and regularly one of the h264 conformance test fails; look at the
> > yellow entries here for instance:
> > http://fate.ffmpeg.org/history.cgi?slot=x86_64-archlinux-gcc-threads-8
> > 
> > The other day, I tried to run an automated git bisect run, but
> > unfortunately testing the potential regression would requires to run
> > something like "while true; do make fate-h264 -j20 THREADS=8; done" for
> > around 15 minutes at least each time, and it might not even be reliable.
> > 
> > I'm not familiar at all with AVC decoding or threading in FFmpeg, but
> > maybe someone has an idea of what could cause this?
> > 
> > Maybe I should just open an issue in the trac?
> 
> OK so here is some more information about that race. Since the failure
> seems to be always the same (failing frame CRCs are identical), I dumped
> both outputs and made a diff. Here it is:
> 
>     http://lucy.pkh.me/race.html
> 
> Basically, we can notice some ±1 byte of difference at times. The
> generated outputs can be downloaded from that page.
> 
> Also, I was able the other day (purely by luck) to have a valgrind
> backtrace triggering the problem. The output was flooded, but here is a
> sample: http://pastie.org/4602183. BTW, it seems the test is using a frame
> based threading.
> 
> BTW, we observe only two failing tests: h264-conformance-cama2_vtc_b
> mostly, but at times h264-conformance-mr1_bt_a fails as well. The
> information I gave above are related to h264-conformance-cama2_vtc_b only.
> 
> Still anyone to look into this? :(
> 

More digging info: it seems the problem (or at least part of it) is around
H264Context->ref_count[2] (frame-based threading), especially when copying it
from the priv_data in lavc/h264.c:decode_update_thread_context():

    libavcodec/h264.c:1248:    copy_fields(h, h1, ref_count, list_count);

Here is a helgrind output triggering that problem:

    ==22439== Lock at 0xD76CCD0 was first observed
    ==22439==    at 0x4C2CF82: pthread_mutex_init (hg_intercepts.c:429)
    ==22439==    by 0x936AE0: ff_thread_init (pthread.c:838)
    ==22439==    by 0x9DD43E: avcodec_open2 (utils.c:933)
    ==22439==    by 0x47016C: transcode_init (ffmpeg.c:1892)
    ==22439==    by 0x45E39F: main (ffmpeg.c:2959)
    ==22439== 
    ==22439== Lock at 0xD768628 was first observed
    ==22439==    at 0x4C2CF82: pthread_mutex_init (hg_intercepts.c:429)
    ==22439==    by 0x936AE0: ff_thread_init (pthread.c:838)
    ==22439==    by 0x9DD43E: avcodec_open2 (utils.c:933)
    ==22439==    by 0x47016C: transcode_init (ffmpeg.c:1892)
    ==22439==    by 0x45E39F: main (ffmpeg.c:2959)
    ==22439== 
    ==22439== Possible data race during read of size 4 at 0xD734C7C by thread #1
    ==22439== Locks held: 1, at address 0xD76CCD0
    ==22439==    at 0x7B6F61: decode_update_thread_context (h264.c:1248)
    ==22439==    by 0x9356BC: ff_thread_decode_frame (pthread.c:458)
    ==22439==    by 0x9DE9B0: avcodec_decode_video2 (utils.c:1568)
    ==22439==    by 0x46D610: decode_video (ffmpeg.c:1570)
    ==22439==    by 0x45FA77: main (ffmpeg.c:1775)
    ==22439== 
    ==22439== This conflicts with a previous write of size 4 by thread #2
    ==22439== Locks held: 1, at address 0xD768628
    ==22439==    at 0x7C2A2C: ff_h264_decode_mb_cabac (h264_cabac.c:2011)
    ==22439==    by 0x7B547F: decode_slice (h264.c:3547)
    ==22439==    by 0x7B5EF1: execute_decode_slices (h264.c:3703)
    ==22439==    by 0x7BBEAE: decode_nal_units (h264.c:4006)
    ==22439==    by 0x7BCC02: decode_frame (h264.c:4118)
    ==22439==    by 0x934C4D: frame_worker_thread (pthread.c:388)
    ==22439==    by 0x4C2CD8D: mythread_wrapper (hg_intercepts.c:219)
    ==22439==    by 0x817DB4F: start_thread (pthread_create.c:304)
    ==22439== 
    ==22439== Address 0xD734C7C is 210556 bytes inside a block of size 421600 alloc'd
    ==22439==    at 0x4C27E16: memalign (vg_replace_malloc.c:727)
    ==22439==    by 0x4C27F07: posix_memalign (vg_replace_malloc.c:876)
    ==22439==    by 0xC1A18F: av_mallocz (mem.c:95)
    ==22439==    by 0x9DD30B: avcodec_open2 (utils.c:847)
    ==22439==    by 0x47016C: transcode_init (ffmpeg.c:1892)
    ==22439==    by 0x45E39F: main (ffmpeg.c:2959)
    ==22439== 

To reproduce this trace, I'm using the following script:

    #!/bin/sh
    runc=0
    while true; do
        echo "Run $runc"
        runc=$((runc+1))
        valgrind \
            --tool=helgrind \
            --suppressions=thread.suppr \
            --db-attach=yes \
            \
            ./ffmpeg_g \
                -d -nostdin -nostats -cpuflags all -vsync drop \
                -threads 2 \
                -thread_type frame+slice -i ~/fate-samples/h264-conformance/cama2_vtc_b.avc \
                -f rawvideo -y out.raw
    done

With the following thread.suppr suppression file, to suppress the race
triggered all the time (race or not):

    {
       common_frame_race
       Helgrind:Race
       fun:frame_thread_free
       fun:avcodec_close
       fun:main
    }

After around 10-50 runs (playing with the CPU seems to help somehow), I'm able
to trigger the gdb attachment for debugging.

Again, I'm not familiar at all with what's going on here, any help is welcome.

-- 
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20121001/c1bf11c4/attachment.asc>