[FFmpeg-devel] [PATCH 1/6] Frame-based multithreading framework using pthreads

Fri Jan 21 11:29:40 CET 2011

On Mon, Dec 27, 2010 at 10:12 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Mon, Nov 15, 2010 at 08:37:01AM -0500, Alexander Strange wrote:
>> See doc/multithreading.txt for details on use in codecs.
>> ---
>>  doc/APIchanges         |    5 +
>>  doc/multithreading.txt |   63 +++++
>>  libavcodec/avcodec.h   |   80 ++++++-
>>  libavcodec/options.c   |    4 +
>>  libavcodec/pthread.c   |  656 +++++++++++++++++++++++++++++++++++++++++++++++-
>>  libavcodec/thread.h    |   98 +++++++
>>  libavcodec/utils.c     |   64 +++++-
>>  libavcodec/w32thread.c |    6 +
>>  libavformat/utils.c    |    5 +
>>  libavutil/internal.h   |   11 +
>>  10 files changed, 979 insertions(+), 13 deletions(-)
>>  create mode 100644 doc/multithreading.txt
>>  create mode 100644 libavcodec/thread.h
>> 
>> diff --git a/doc/APIchanges b/doc/APIchanges
>> index b6806f8..c8892d9 100644
>> --- a/doc/APIchanges
>> +++ b/doc/APIchanges
>> @@ -13,6 +13,11 @@ libavutil:   2009-03-08
>> 
>>  API changes, most recent first:
>> 
>> +2010-11-XX - rX - lavc 52.xx.0 - threading API
>> +  Add CODEC_CAP_FRAME_THREADS with new restrictions on get_buffer()/
>> +  release_buffer()/draw_horiz_band() callbacks for appropriate codecs.
>> +  Add thread_type and active_thread_type fields to AVCodecContext.
>> +
>>  2010-11-13 - r25745 - lavc 52.95.0 - AVCodecContext
>>    Add AVCodecContext.subtitle_header and AVCodecContext.subtitle_header_size
>>    fields.
>> diff --git a/doc/multithreading.txt b/doc/multithreading.txt
>> new file mode 100644
>> index 0000000..65bdfb1
>> --- /dev/null
>> +++ b/doc/multithreading.txt
>> @@ -0,0 +1,63 @@
>> +FFmpeg multithreading methods
>> +==============================================
>> +
>> +FFmpeg provides two methods for multithreading codecs.
>> +
>> +Slice threading decodes multiple parts of a frame at the same time, using
>> +AVCodecContext execute() and execute2().
>> +
>> +Frame threading decodes multiple frames at the same time.
>> +It accepts N future frames and delays decoded pictures by N-1 frames.
>> +The later frames are decoded in separate threads while the user is
>> +displaying the current one.
>> +
>> +Restrictions on clients
>> +==============================================
>> +
>> +Slice threading -
>> +* The client's draw_horiz_band() must be thread-safe according to the comment
>> +  in avcodec.h.
>> +
>> +Frame threading -
>> +* Restrictions with slice threading also apply.
>> +* The client's get_buffer() and release_buffer() must be thread-safe as well.
>> +* There is one frame of delay added for every thread beyond the first one.
>> +  Clients using dts must account for the delay; pts sent through reordered_opaque
>> +  will work as usual.
>> +
>> +Restrictions on codec implementations
>> +==============================================
>> +
>> +Slice threading -
>> + None except that there must be something worth executing in parallel.
>> +
>> +Frame threading -
>> +* Codecs can only accept entire pictures per packet.
>> +* Codecs similar to ffv1, whose streams don't reset across frames,
>> +  will not work because their bitstreams cannot be decoded in parallel.
>> +
> 
>> +* The contents of buffers must not be read before ff_thread_await_progress()
>> +  has been called on them. reget_buffer() and buffer age optimizations no longer work.
> 
> why does the buffer age optimization not work anymore?
> if a MB is skiped in 100 frames and you have 10 frame decoding threads then
> the optim should allow some skiping still. And actually this should be easy
> And it does happen in reality (the black bars on top and bottom of films)

The problem is that each decoding thread has its own copy of the AVCodecContext, and when get_buffer() is called it uses that copy. So they end up with their own internal_buffer, and the ages don't make sense when the AVFrames travel across threads.

This can be fixed by sharing the same internal_buffer between threads. But I tried it and everything broke; I spent too much time looking at it so I've punted it into a branch:

http://gitorious.org/~astrange/ffmpeg/ffmpeg-mt/commit/e5622e99c9608e612e9f386c1711d76ce08268e1

The nice thing is that mt adds so much speed (clock time) than skipping that I found it didn't really matter on mpeg4 I tried, so I think this can be put off. VP3 and H264 don't use it at all.

The new patch works with mplayer dr + custom get_buffer() by default, as it calls get_buffer() on the client's thread by default.
Setting thread_safe_callbacks restores the old behavior and ffplay uses this.
!thread_safe_callbacks is slower, and dr is not any faster than !dr for me, so it can probably improve a little by disabling dr.

> [...]
>> @@ -1203,7 +1219,8 @@ typedef struct AVCodecContext {
>>       * If non NULL, 'draw_horiz_band' is called by the libavcodec
>>       * decoder to draw a horizontal band. It improves cache usage. Not
>>       * all codecs can do that. You must check the codec capabilities
>> -     * beforehand.
>> +     * beforehand. May be called by different threads at the same time,
>> +     * so implementations must be reentrant.
> 
> thread safe or reentrant ?
> for many video filters, slice N has to be done before N+1 starts, and they have
> to be called in order even if you mix calls from several frames.

Unfortunately the patch doesn't enforce this. I looked at it and I believe this was already not enforced (just decode mpeg2 with threads on and it will be called out of order), but nothing ever seemed to use draw_horiz_band() with 4:2:0 video, so nobody noticed.

Wrote a longer comment about what you can expect. (which is that anything at all can happen)

Note that playing 4:2:2 Theora video with this patch series will break in mplayer because of this, but it's better to turn it off in mplayer than to fix it in here IMO.

>> 
>>       * The function is also used by hardware acceleration APIs.
>>       * It is called at least once during frame decoding to pass
>>       * the data needed for hardware render.
>> @@ -1457,6 +1474,9 @@ typedef struct AVCodecContext {
>>       * if CODEC_CAP_DR1 is not set then get_buffer() must call
>>       * avcodec_default_get_buffer() instead of providing buffers allocated by
>>       * some other means.
>> +     * May be called from a different thread if thread_type==FF_THREAD_FRAME
>> +     * is set, but not by more than one thread at once, so does not need to be
>> +     * reentrant.
> 
> isnt it thread_type&FF_THREAD_FRAME ?

Yes.

> and it can be called from more than one thread at once i think

There's a mutex around it (buffer_mutex) so it can't be.

> [...]
>> @@ -2811,6 +2863,26 @@ typedef struct AVCodec {
>>      const int64_t *channel_layouts;         ///< array of support channel layouts, or NULL if unknown. array is terminated by 0
>>      uint8_t max_lowres;                     ///< maximum value for lowres supported by the decoder
>>      AVClass *priv_class;                    ///< AVClass for the private context
>> +
>> +    /**
>> +     * @defgroup framethreading Frame-level threading support functions.
>> +     * @{
>> +     */
>> +    /**
>> +     * If defined, called on thread contexts when they are created.
>> +     * If the codec allocates writable tables in init(), re-allocate them here.
>> +     * priv_data will be set to a copy of the original.
>> +     */
>> +    int (*init_thread_copy)(AVCodecContext *);
> 
>> +    /**
>> +     * Copy necessary context variables from a previous thread context to the current one.
>> +     * If not defined, the next thread will start automatically; otherwise, the codec
>> +     * must call ff_thread_finish_setup().
>> +     *
>> +     * dst and src will (rarely) point to the same context, in which case memcpy should be skipped.
>> +     */
>> +    int (*update_thread_context)(AVCodecContext *dst, AVCodecContext *src);
> 
> src could be const

Done.

> 
> 
> [...]
>> +int ff_thread_decode_frame(AVCodecContext *avctx,
>> +                           AVFrame *picture, int *got_picture_ptr,
>> +                           AVPacket *avpkt)
>> +{
>> +    FrameThreadContext *fctx = avctx->thread_opaque;
>> +    int finished = fctx->next_finished;
>> +    PerThreadContext *p;
>> +    int err;
>> +
>> +    /*
>> +     * Submit a frame to the next decoding thread.
>> +     */
>> +
>> +    p = &fctx->threads[fctx->next_decoding];
>> +    update_context_from_user(p->avctx, avctx);
>> +    err = submit_frame(p, avpkt);
>> +    if (err) return err;
>> +
>> +    fctx->next_decoding++;
>> +
>> +    /*
>> +     * If we're still receiving the initial frames, don't return a picture.
>> +     */
>> +
>> +    if (fctx->delaying && avpkt->size) {
>> +        if (fctx->next_decoding >= (avctx->thread_count-1)) fctx->delaying = 0;
>> +
>> +        *got_picture_ptr=0;
>> +        return 0;
>> +    }
>> +
>> +    /*
>> +     * Return the next available picture from the oldest thread.
>> +     * If we're at the end of the stream, then we have to skip threads that
>> +     * didn't output a picture, because we don't want to accidentally signal
>> +     * EOF (avpkt->size == 0 && *got_picture_ptr == 0).
>> +     */
>> +
>> +    do {
>> +        p = &fctx->threads[finished++];
>> +
>> +        if (p->state != STATE_INPUT_READY) {
>> +            pthread_mutex_lock(&p->progress_mutex);
>> +            while (p->state != STATE_INPUT_READY)
>> +                pthread_cond_wait(&p->output_cond, &p->progress_mutex);
>> +            pthread_mutex_unlock(&p->progress_mutex);
>> +        }
>> +
>> +        *picture = p->picture;
>> +        *got_picture_ptr = p->got_picture;
>> +
>> +        avcodec_get_frame_defaults(&p->picture);
>> +        p->got_picture = 0;
>> +
>> +        if (finished >= avctx->thread_count) finished = 0;
>> +    } while (!avpkt->size && !*got_picture_ptr && finished != fctx->next_finished);
>> +
>> +    update_context_from_thread(avctx, p->avctx, 1);
>> +
>> +    if (fctx->next_decoding >= avctx->thread_count) fctx->next_decoding = 0;
>> +
>> +    fctx->next_finished = finished;
>> +
>> +    return p->result;
>> +}
> 
> I think this design has some issues.
> 1. (minor) I dont see why you dont return frames when they are ready at the
>   begin
> 2. Frames can have variable duration, if the picture doesnt change. A timespan
>   of 2 seconds without changes is quite realistic, 32 frames make one minute
>   delay, this will kill AV sync with some players and could also underflow
>   video buffers if they are time based (iam thinking of RT*P here).
>  Thus i think frames should be returned when they are ready

> 3. This code blocks when no frame is available to be returned, thats ok but
>   we should try to also support a non blocking EAGAIN returning form,
>   its usefull for players that try to also do other things in the decoding
>   thread and dont want to be blocked for half a second in unlucky cases

> [...]
>> +    /*
>> +     * Buffer age is difficult to keep track of between
>> +     * multiple threads, and the optimizations it allows
>> +     * are not worth the effort. It is disabled for now.
>> +     */
>> +    f->age = INT_MAX;
> 
> I see nothing hard in relation to age and threads.
> Please elaborate what is difficult

Commented up there.

> [...]
>> +/**
>> + * Set the threading algorithms used.
>> + *
>> + * Threading requires more than one thread.
>> + * Frame threading requires entire frames to be passed to the codec,
>> + * and is incompatible with low_delay.
>> + *
>> + * @param avctx The context.
>> + */
>> +static void validate_thread_parameters(AVCodecContext *avctx)
>> +{
>> +    int frame_threading_supported = (avctx->codec->capabilities & CODEC_CAP_FRAME_THREADS)
>> +                                && !(avctx->flags & CODEC_FLAG_TRUNCATED)
>> +                                && !(avctx->flags & CODEC_FLAG_LOW_DELAY)
>> +                                && !(avctx->flags2 & CODEC_FLAG2_CHUNKS);
>> +    if (avctx->thread_count == 1)
>> +        avctx->active_thread_type = 0;
>> +    else if (frame_threading_supported && (avctx->thread_type & FF_THREAD_FRAME))
>> +        avctx->active_thread_type = FF_THREAD_FRAME;
>> +    else
>> +        avctx->active_thread_type = FF_THREAD_SLICE;
> 
> style nitpick if{}else if{}else

Done.

> [...]
>> +/**
>> + * Waits for decoding threads to finish and resets internal
>> + * state. Called by avcodec_flush_buffers().
>> + *
>> + * @param avctx The context.
>> + */
>> +void ff_thread_flush(AVCodecContext *avctx);
>> +
>> +/**
>> + * Submits a new frame to a decoding thread.
>> + * Returns the next available frame in picture. *got_picture_ptr
>> + * will be 0 if none is available.
>> + *
>> + * Parameters are the same as avcodec_decode_video2().
>> + */
>> +int ff_thread_decode_frame(AVCodecContext *avctx, AVFrame *picture,
>> +                           int *got_picture_ptr, AVPacket *avpkt);
>> +
>> +/**
>> + * Codecs which define update_thread_context should call this
>> + * when they are ready for the next thread to start decoding
>> + * the next frame. After calling it, do not change any variables
>> + * read by the update_thread_context method.
>> + *
>> + * @param avctx The context.
>> + */
>> +void ff_thread_finish_setup(AVCodecContext *avctx);
>> +
> 
> 
> 
>> +/**
>> + * Call this when some part of the picture is finished decoding.
>> + * Later calls with lower values of progress have no effect.
>> + *
>> + * @param f The picture being decoded.
>> + * @param progress Value, in arbitrary units, of how much of the picture has decoded.
>> + * @param field The field being decoded, for field-picture codecs.
>> + * 0 for top field or frame pictures, 1 for bottom field.
>> + */
>> +void ff_thread_report_progress(AVFrame *f, int progress, int field);
>> +
>> +/**
>> + * Call this before accessing some part of a picture.
>> + *
>> + * @param f The picture being referenced.
>> + * @param progress Value, in arbitrary units, to wait for.
>> + * @param field The field being referenced, for field-picture codecs.
>> + * 0 for top field or frame pictures, 1 for bottom field.
>> + */
>> +void ff_thread_await_progress(AVFrame *f, int progress, int field);
>> +
>> +/**
>> + * Call this function instead of avctx->get_buffer(f).
>> + *
>> + * @param avctx The current context.
>> + * @param f The frame to write into.
>> + */
>> +int ff_thread_get_buffer(AVCodecContext *avctx, AVFrame *f);
>> +
>> +/**
>> + * Call this function instead of avctx->release_buffer(f).
>> + *
>> + * @param avctx The current context.
>> + * @param f The picture being released.
>> + */
>> +void ff_thread_release_buffer(AVCodecContext *avctx, AVFrame *f);
> 
> You should document what the functions "do" not when to call them.
> Well you should document when to call them too but what they do is more
> important IMHO. And i mean from a high level view not low level implementation
> dox of course

Wrote more comments.

>> 
>> +
>> +#endif /* AVCODEC_THREAD_H */
>> diff --git a/libavcodec/utils.c b/libavcodec/utils.c
>> index 3875140..c931484 100644
>> --- a/libavcodec/utils.c
>> +++ b/libavcodec/utils.c
>> @@ -36,6 +36,7 @@
>>  #include "dsputil.h"
>>  #include "libavutil/opt.h"
>>  #include "imgconvert.h"
>> +#include "thread.h"
>>  #include "audioconvert.h"
>>  #include "internal.h"
>>  #include <stdlib.h>
>> @@ -256,6 +257,11 @@ int avcodec_default_get_buffer(AVCodecContext *s, AVFrame *pic){
>>      (*picture_number)++;
>> 
>>      if(buf->base[0] && (buf->width != w || buf->height != h || buf->pix_fmt != s->pix_fmt)){
>> +        if(s->active_thread_type&FF_THREAD_FRAME) {
>> +            av_log_missing_feature(s, "Width/height changing with frame threads is", 0);
>> +            return -1;
>> +        }
>> +
>>          for(i=0; i<4; i++){
>>              av_freep(&buf->base[i]);
>>              buf->data[i]= NULL;
>> @@ -517,18 +523,29 @@ int attribute_align_arg avcodec_open(AVCodecContext *avctx, AVCodec *codec)
>>          goto free_and_end;
>>      }
>>      avctx->frame_number = 0;
>> +
>> +    if (HAVE_THREADS && !avctx->thread_opaque) {
>> +        ret = avcodec_thread_init(avctx, avctx->thread_count);
>> +        if (ret < 0) {
>> +            goto free_and_end;
>> +        }
>> +    }
>> +
>>      if (avctx->codec->max_lowres < avctx->lowres) {
>>          av_log(avctx, AV_LOG_ERROR, "The maximum value for lowres supported by the decoder is %d\n",
>>                 avctx->codec->max_lowres);
>>          goto free_and_end;
>>      }
>> 
>> -    if(avctx->codec->init){
>> -        ret = avctx->codec->init(avctx);
>> -        if (ret < 0) {
>> -            goto free_and_end;
>> +    if(avctx->codec->init && !(avctx->active_thread_type&FF_THREAD_FRAME)){
>> +        if(avctx->codec->init){
>> +            ret = avctx->codec->init(avctx);
>> +            if (ret < 0) {
>> +                goto free_and_end;
>> +            }
>>          }
> 
>>      }
>> +
>>      ret=0;
> 
> cosmetic

Removed. (there was a funny merge glitch there too)

> 
> 
> [...]
>> @@ -737,7 +757,7 @@ av_cold int avcodec_close(AVCodecContext *avctx)
>> 
>>      if (HAVE_THREADS && avctx->thread_opaque)
>>          avcodec_thread_free(avctx);
>> -    if (avctx->codec && avctx->codec->close)
>> +    if (avctx->codec && avctx->codec->close && !(avctx->active_thread_type&FF_THREAD_FRAME))
>>          avctx->codec->close(avctx);
>>      avcodec_default_free_buffers(avctx);
>>      avctx->coded_frame = NULL;
> 
> Moving init/close elsewhere is a bit ugly, is this unavoidable?

close: avoidable, removed.
init: not avoidable, I just rediscovered why it was and then forgot because it's 5 AM.

If it is avoidable, the change would involve more than one line.

>> 
>> @@ -745,6 +765,7 @@ av_cold int avcodec_close(AVCodecContext *avctx)
>>      if(avctx->codec && avctx->codec->encode)
>>          av_freep(&avctx->extradata);
>>      avctx->codec = NULL;
>> +    avctx->active_thread_type = 0;
>>      entangled_thread_counter--;
>> 
>>      /* Release any user-supplied mutex. */
>> @@ -985,6 +1006,8 @@ void avcodec_init(void)
>> 
>>  void avcodec_flush_buffers(AVCodecContext *avctx)
>>  {
>> +    if(HAVE_PTHREADS && avctx->active_thread_type&FF_THREAD_FRAME)
>> +        ff_thread_flush(avctx);
>>      if(avctx->codec->flush)
>>          avctx->codec->flush(avctx);
>>  }
>> @@ -1185,3 +1208,30 @@ unsigned int ff_toupper4(unsigned int x)
>>              + (toupper((x>>16)&0xFF)<<16)
>>              + (toupper((x>>24)&0xFF)<<24);
>>  }
>> +
> [...]
> 
>> diff --git a/libavformat/utils.c b/libavformat/utils.c
>> index 8fcbe96..0b5d89c 100644
>> --- a/libavformat/utils.c
>> +++ b/libavformat/utils.c
>> @@ -913,6 +913,11 @@ static void compute_pkt_fields(AVFormatContext *s, AVStream *st,
>>      /* do we have a video B-frame ? */
>>      delay= st->codec->has_b_frames;
>>      presentation_delayed = 0;
>> +
>> +    // ignore delay caused by frame threading
>> +    if (delay && st->codec->active_thread_type&FF_THREAD_FRAME)
>> +        delay -= st->codec->thread_count-1;
>> +
>>      /* XXX: need has_b_frame, but cannot get it if the codec is
>>          not initialized */
>>      if (delay &&
> 
> This looks wrong
> Please elaborate on what you are trying to do here

I removed the change and everything still appears to work.
I think it may have been fixed by using pkt_dts??

The problem before was that threading increased has_b_frames, which caused it to print the "invalid dts/pts combination" warning on codecs that had dts==pts all the time like vp3. But I don't see it now.

Attached another patch that just adopts pkt_dts in some places so a/v sync works better.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Adopt-pkt_pts-pkt_dts-in-lavc-clients.patch
Type: application/octet-stream
Size: 2651 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110121/0156578f/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Frame-based-multithreading-framework-using-pthreads.patch
Type: application/octet-stream
Size: 46907 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110121/0156578f/attachment-0001.obj>
-------------- next part --------------