[FFmpeg-devel] [PATCH] WIP: subtitles in AVFrame

Sun Nov 20 14:52:02 EET 2016

On Fri, Nov 11, 2016 at 04:46:51PM +0100, wm4 wrote:
[...]
> > +// TODO: delete this compatibility code when all subtitles encoders moved
> > +// to send_frame
> > +static int encode_subtitle_frame(AVCodecContext *avctx,
> > +                                 AVPacket *avpkt,
> > +                                 const AVFrame *frame,
> > +                                 int *got_packet_ptr)
> > +{
> > +    int i, ret;
> > +
> > +FF_DISABLE_DEPRECATION_WARNINGS
> > +    AVPacket tmp_pkt;
> > +    AVSubtitle subtitle;
> > +
> > +    get_subtitle_defaults(&subtitle);
> > +
> > +    /* Allocate (once) a temporary large output subtitle buffer */
> > +    if (!avctx->internal->byte_buffer) {
> > +        const int subtitle_out_max_size = 1024 * 1024;
> > +        uint8_t *subtitle_out = av_malloc(subtitle_out_max_size);
> > +        if (!subtitle_out)
> > +            return AVERROR(ENOMEM);
> > +
> > +        avctx->internal->byte_buffer      = subtitle_out;
> > +        avctx->internal->byte_buffer_size = subtitle_out_max_size;
> > +    }
> 
> This is used for encoding below... does the old API really not have a
> better way for this?
> 

Yeah, apparently. The subtitle API encoding never got updated; there is no
number at the end of "avcodec_encode_subtitle", which gives you an idea
about its era.

> I would have thought 1MB is a bit small, but then again I looked at
> ffmpeg.c, and it's using the same fixed size buffer.
> 

That's indeed where I took the value. Note that this function exists for
compatibility. Later on, we will likely make something smarter, but for
now I'm just making a wrapper: the priority is to make it clean and
transparent for the user (no AVSubtitle visible), then we can take the
time to cleanup our mess internally (making codecs handle AVFrame directly
instead of AVSubtitle)

[...]

(allocation issues raised fixed locally)

[...]
> > +static int decode_subtitle_frame(AVCodecContext *avctx,
> > +                                 AVFrame *frame,
> > +                                 int *got_frame_ptr,
> > +                                 const AVPacket *avpkt)
> > +{
> > +    int i, ret = 0;
> > +
> > +    *got_frame_ptr = 0;
> > +
> > +    if (!avctx->codec)
> > +        return AVERROR(EINVAL);
> > +
> > +    if (!avpkt->data && avpkt->size) {
> > +        av_log(avctx, AV_LOG_ERROR, "invalid packet: NULL data, size != 0\n");
> > +        return AVERROR(EINVAL);
> > +    }
> 
> didn't vobsubs have 0-sized packets at some point? Also, I think the
> caller should check this (avcodec_send_packet).
> 

That's the first check in avcodec_decode_subtitle2() so I suppose that's
valid. It's also present in avcodec_decode_audio4().

As you guessed later, that function is mostly a copy/paste from
avcodec_decode_subtitle2(). The difference being that there is
additionally a convert from AVSubtitle to AVFrame inside.

> > +
> > +    av_assert0(avctx->codec->type == AVMEDIA_TYPE_SUBTITLE);
> > +
> > +    av_frame_unref(frame);
> 
> It's already guaranteed to be unreffed.
> 

OK, dropped

> > +
> > +    if ((avctx->codec->capabilities & AV_CODEC_CAP_DELAY) || avpkt->size) {
> > +        AVPacket pkt_recoded;
> > +        AVPacket tmp = *avpkt;
> > +        int did_split = av_packet_split_side_data(&tmp);
> > +
> > +        if (did_split) {
> > +            /* FFMIN() prevents overflow in case the packet wasn't allocated with
> > +             * proper padding.
> > +             * If the side data is smaller than the buffer padding size, the
> > +             * remaining bytes should have already been filled with zeros by the
> > +             * original packet allocation anyway. */
> > +            memset(tmp.data + tmp.size, 0,
> > +                   FFMIN(avpkt->size - tmp.size, AV_INPUT_BUFFER_PADDING_SIZE));
> > +        }
> > +        pkt_recoded = tmp;
> > +
> > +        ret = recode_subtitle(avctx, &pkt_recoded, &tmp);
> > +        if (ret >= 0) {
> 
> IMO early exit would be nicer...
> 
> This is mostly the old code copy&pasted, I assume.
> 

Yes, I'd like to limit the differences with the original function since
this is kind of a large work with many risks of getting things wrong.

But after reading your review below, I realized I could just call
avcodec_decode_subtitle2() in that function and do the convert afterwards,
which is what I just did locally, addressing as a result a bunch of your
comments later on.

[...]
> > +
> > +            frame->sub_start_display = sub.start_display_time ? av_rescale_q(sub.start_display_time, av_make_q(1, 1000), AV_TIME_BASE_Q)
> > +                                                              : AV_NOPTS_VALUE;
> > +            frame->sub_end_display   = sub.end_display_time   ? av_rescale_q(sub.end_display_time,   av_make_q(1, 1000), AV_TIME_BASE_Q)
> > +                                                              : AV_NOPTS_VALUE;
> 
> end_display_time can be UINT32_MAX for "infinite". I think we should
> clean this up here, and differentiate between "0", "unknown", and
> "until next event".
> 

I don't even know what's the convention today, so this part is currently
mostly ignored from this first prototype.

What would be the difference between "infinite" and "until next
(potentially clear) event"?

[...]
> > +static const AVOption sbuffer_options[] = {
> > +    { "time_base",      NULL, OFFSET(time_base),  AV_OPT_TYPE_RATIONAL, { .dbl = 0 }, 0, INT_MAX, S },
> > +    { "format",         NULL, OFFSET(sub_format), AV_OPT_TYPE_INT,      { .i64 = -1 }, -1, 1, S, "format" },
> > +        { "unspecified", NULL, 0, AV_OPT_TYPE_CONST, {.i64=-1}, INT_MIN, INT_MAX, S, "format" },
> > +        { "bitmap",      NULL, 0, AV_OPT_TYPE_CONST, {.i64=0},  INT_MIN, INT_MAX, S, "format" },
> > +        { "text",        NULL, 0, AV_OPT_TYPE_CONST, {.i64=1},  INT_MIN, INT_MAX, S, "format" },
> 
> Implies no mixed subs. I'm fine with that, but might be restrictive.
> 

Yeah, so far we don't have such things AFAIK. But we could replace this
with a mask later on. It was simpler for the initial format negotiation
that way.

[...]
> > +    } else if (type == AVMEDIA_TYPE_SUBTITLE) {
> > +        if (ff_add_format(&ret, 0 /* bitmap */) < 0 ||
> > +            ff_add_format(&ret, 1 /* text */) < 0)
> > +            return NULL;
> 
> Questionable, is probably preliminary code.
> 

Yes, preliminary code. It involves making a decision about what you just
raised earlier so I make a quick experiment with something as basic as
what we have today.

[...]
> >      while (av_read_frame(fmt, &pkt) >= 0) {
> > -        int i, got_subtitle;
> > -        AVSubtitle sub = {0};
> > -
> >          if (pkt.stream_index == sid) {
> > -            ret = avcodec_decode_subtitle2(dec_ctx, &sub, &got_subtitle, &pkt);
> > -            if (ret < 0) {
> > -                av_log(ctx, AV_LOG_WARNING, "Error decoding: %s (ignored)\n",
> > -                       av_err2str(ret));
> > -            } else if (got_subtitle) {
> > -                const int64_t start_time = av_rescale_q(sub.pts, AV_TIME_BASE_Q, av_make_q(1, 1000));
> > -                const int64_t duration   = sub.end_display_time;
> > -                for (i = 0; i < sub.num_rects; i++) {
> > -                    char *ass_line = sub.rects[i]->ass;
> > +            int i;
> > +
> > +            ret = avcodec_send_packet(dec_ctx, &pkt);
> > +            if (ret < 0)
> > +                break;
> > +            // XXX probably not correct api usage
> > +            ret = avcodec_receive_frame(dec_ctx, frame);
> > +            if (ret >= 0) {
> 
> I've found that converting code which used to old API to the new API is
> very messy. I think that's mostly because the data flow is rather
> different.
> 
> Correct API usage would be something like:
> 
>   while (1) {
>     if (av_codec_send_packet() == AVERROR_EAGAIN)
>       add_packet_back_to_demuxer_queue()
>     while (1) {
>       ret = avcodec_receive_frame
>       if (ret == AVERROR_EAGAIN)
>          break; // new packet needed
>       /* process frame */
>     }
>   }
> 
> Plus EOF and error handling.
> 
> avcodec_send_packet() returning EAGAIN in this specific situation can
> probably be reasonably treated as error instead. Unless the decoder
> violates the declared API rules.
> 

I might actually not clean that code but mark it as deprecated instead,
and make the filter take a stream directly (instead of a filename). It was
an initial attempt to make some tests without updating ffmpeg.c,
integrating into lavfi, etc. Now I can do it properly, so that code will
go away.

[...]
> > diff --git a/libavformat/assenc.c b/libavformat/assenc.c
> > index d50f18f..97a8de8 100644
> > --- a/libavformat/assenc.c
> > +++ b/libavformat/assenc.c
> > @@ -164,6 +164,9 @@ static int write_packet(AVFormatContext *s, AVPacket *pkt)
> >      int hh2, mm2, ss2, ms2;
> >      DialogueLine *dialogue = av_mallocz(sizeof(*dialogue));
> >  
> > +    if (pkt->duration < 0)
> > +        end = INT64_MAX;
> 
> Wait what.
> 

This needs to be in a separate patch with a longer explanation. IIRC I
didn't that because the last SAMI subtitle has no duration so it needs to
be muxed as "til the end".

[...]
> > +static int get_subtitle_buffer(AVFrame *frame)
> > +{
> > +    int i, ret;
> > +
> > +    ret = get_data_buffer(frame, frame->sub_nb_rects, sizeof(AVFrameSubtitleRectangle));
> > +    if (ret < 0)
> > +        return ret;
> > +    for (i = 0; i < frame->sub_nb_rects; i++)
> > +        memset(frame->extended_data[i], 0, sizeof(AVFrameSubtitleRectangle));
> > +    return 0;
> 
> I'm noticing it doesn't allocate the actual subtitle bitmap or text
> data, which is a difference from audio/video.
> 

Yes, exactly. This is tricky. It's really tricky because contrary to audio
and video, we can not know the size required for every (not yet allocated)
rectangle from the raw format information alone.

For video, we have the pixel format, and dimensions. For audio we have the
sample format and number of samples. But for subtitles, we can at most
have the number of rectangles.

To workaround this problem, I just made the get_buffer callback allocate
the rectangles holder. The decoder will then allocate every rectangle data
(of different sizes) manually, be it text or actual bitmap rectangle.

But indeed this means that every ref counting will mean a new allocation
and copy of the rectangles. As subtitles are meant to be small and sparse,
I'm assuming it's not exactly a problem.

Do you have a better suggestion?

Maybe we could have the user (decoders) allocate the rectangles holder and
fill all the sizes instead, and then call the get buffer, but this might
be annoying as it will require for most of them to do it in two-pass
instead of one as of today.

> >  }
> >  
> >  int av_frame_get_buffer(AVFrame *frame, int align)
> >  {
> > +    if (frame->sub_nb_rects)
> > +        return get_subtitle_buffer(frame);
> 
> No format set is valid for subtitles?
> 

Yes, no format set means text. If the format is set, it defines the format
used by the rectangles.

[...]
> > +    /* image data for bitmap subtitles, in AVFrame.format (AVPixelFormat) */
> > +    uint8_t *data[AV_NUM_DATA_POINTERS];
> > +    int linesize[AV_NUM_DATA_POINTERS];
> 
> Does it define the format somewhere?
> 

in AVFrame.format

> > +
> > +    /* decoded text for text subtitles, in ASS */
> > +    char *text;
> 
> I'm fine with this - other would probably argue this should be in
> data[0] or so.
> 
> > +
> > +    int flags;
> 
> Which flags (missing doc.)?
> 

That's the current flags, I need to move/copy them to lavu.

> > +} AVFrameSubtitleRectangle;
> 
> The struct should probably be extendable (so its size is not part of
> the ABI).
> 

In the current design, there is no need to ever allocate them: you set the
number of rectangles in your AVFrame, then call av_frame_get_buffer()
which will allocate them.

> > +
> >  /**
> >   * This structure describes decoded (raw) audio or video data.
> >   *
> > @@ -182,7 +201,6 @@ typedef struct AVFrameSideData {
> >   * for AVFrame can be obtained from avcodec_get_frame_class()
> >   */
> >  typedef struct AVFrame {
> > -#define AV_NUM_DATA_POINTERS 8
> >      /**
> >       * pointer to the picture/channel planes.
> >       * This might be different from the first allocated byte
> > @@ -224,9 +242,12 @@ typedef struct AVFrame {
> >       * For packed audio, there is just one data pointer, and linesize[0]
> >       * contains the total size of the buffer for all channels.
> >       *
> > +     * For subtitles, each pointer corresponds to an AVFrameSubtitleRectangle.
> > +     *
> 
> It should also mention how many data pointers there are.
> 
> Btw. I'm very not much of a fan to have to deal with extended_data. And
> the duplication with data. (Does it even document what data[] contains
> for subtitles?)
> 

It's exactly like audio: data[] points on the firsts extended_data[]

[...]

I'll try to address every other comment and send a new and more complete
version later.

Thanks a lot of the review.

-- 
Clément B.