[FFmpeg-devel] Internal handling of subtitles in ffmpeg

Fri Jan 2 17:01:39 CET 2009

On Fri, Jan 02, 2009 at 02:32:18PM +0100, Michael Niedermayer wrote:
> what about?
> 
> Index: libavcodec/avcodec.h
> ===================================================================
> --- libavcodec/avcodec.h	(revision 16398)
> +++ libavcodec/avcodec.h	(working copy)
> @@ -2375,15 +2375,34 @@
>  
>  } AVPaletteControl attribute_deprecated;
>  
> +enum AVSubtitleType {
> +    BITMAP_SUBTITLE,                ///< A bitmap, pict will be set
> +    TEXT_SUBTITLE,                  ///< Plain text, the text and ass fields will be set

IMO: Plain text, the text field must be set and is authoritative. ass
and pict fields may contain approximations.

> +    ASS_SUBTITLE,                   ///< Text+formating, the text and ass fields will be set

IMO: Formatted text, the ass field must be set and is authoritative. pict
and text fields may contain approximations.

Reason: require generation of alternative formats when they are not
wanted?
Authoritative means that format will be used when the input is in any
way processed by libav*, all others will be ignored.
(actually, probably it is better to make authoritative mean that this
format is preferred if it is supported - I guess your reason for allowing
multiple ones in one AVSubtitleRect is to allow caching of conversion results or?).

>  typedef struct AVSubtitleRect {
>      uint16_t x;
>      uint16_t y;

///< BITMAP_SUBTITLE: top right corner of AVPicture position
     TEXT_SUBTITLE: optional recommended center position. 0 means N/A
     ASS_SUBTITLE: no meaning, must be 0.

I do not like the special meaning of 0 for TEXT_SUBTITLE.
It would also mean that only one of pict, text and ass can be set.

>      uint16_t nb_colors;

Does that one serve any purpose? 
Also, it might make sense to also allow a colour for plain text.

> -    int linesize;
> -    uint32_t *rgba_palette;
> -    uint8_t *bitmap;
> +
> +    /**
> +     * data+linesize for the bitmap of this subtitle.
> +     * can be set for text/ass as well once they where rendered
> +     */
> +    AVPicture pict;
> +    enum AVSubtitleType type;
> +
> +    char *text;                     ///< 0 terminated plain UTF-8 text
> +
> +    /**
> +     * 0 terminated ASS/SSA compatible event line.
> +     * The pressentation of this is unaffected by the other values in this
> +     * struct.
> +     */
> +    char *ass;

I think I'd tend towards
union {
AVSubtitleBitmap pict;
AVSubtitleText text;
char *ass;
}

with
struct AVSubtitleBitmap {
uint16_t x, y, w, h;
AVPicture pict;
}

struct AVSubtitleText {
int x, y; // center position in percent (0 - 100), -1 not specified
uint32_t color; // RGBA colour. Full transparency means unspecified.
char *text;
}

Colour is somewhat questionable I admit, why not add bold, font size,
name etc. until you have reimplemented ASS. I don't really know the
answer to that.
I suspect you have your good reasons for not wanting a union, still some
grouping IMO would be good.
Also I think AVSubtitle should specify which types might be used in the
AVSubtitleRects so you can do something simple in the spirit of
if (subs.types != TEXT_SUBTITLE)
  av_convert_subs(subs, TEXT_SUBTITLE);

> The problems i have with your solution is
> 1. You only vaguly describe it, and it changes from argument to argument,
>    This makes it impossible to implement or compare properly against other
>    suggestions.

I don't have a solution to offer, your suggestion sounded like when
someone without constantly asking for feedback tried to implement it,
it would result in something too complex
for me to use and like something I wouldn't find the motivation to
reimplement movsub_bsf.c in, so I complained.
>From there I fear the discussion went into pointlessness real fast,
sorry, I still have to learn to notice that faster.

> 2. Using bitstream filters is not simple, even less so with the current
>    requiement of manual addition of them, but then i dont even know if you
>    still suggest that we should use them instead of decoder/encoder

I don't think I meant them as a proper solution but as "if someone is in
a hurry, text format subtitle conversion is already done in bitstream
filters and is known to work. Use it if you want a working solution now."
I guess I got lost in other arguments too much to make it understood
like that.

> if now considering 3 we allow several ass blobs per AVSubtitle and considering
> 2 we dont use bitstream filters, then your suggestion seems identical to mine.
> If not id like you to elaborate on what the remaining difference is.

Nothing significant. I'd tend to not allow more than one type per
AVSubtitleRect, make it easy to avoid AVSubtitles with different types
in AVSubtitleRect, use and provide only (center!) position but not
width/height for text.

> > > I was arguing to export values through a struct instead of a char* using a
> > > using a complex encoding.
> > 
> > And I say: If the problem is difficult enough, your struct becomes just
> > yet another complex encoding, i.e. you win a minor simplification by doubling
> > the number of representations.
> 
> Let me be rude, did you ever read the ASS/SSA spec? Or do you just assume
> its a magic black box that has hundreads of obscene fields in convoluted
> interrelations?

No I did not read the ASS spec, maybe I should have added the disclaimer
that so far I did not consider its specifics relevant to the discussion.
I was disputing the assertion that splitting it into a struct always
makes things significantly less complex.
Which I mostly started mentioning because I think that the smaller
flexibility of a binary (struct) compared to a text representation makes
it harder to allow to accurately convert _every_ (current and future)
subtitle format to it, which I (maybe misunderstood) to be the
"non-negotiable" goal of the "one subtitle representation".

[...]
> > 2) Next, should the AVSubtitleRects be in logical order or rendering order?
> > Your examples for the simplest text rendering assumed logical order.
> > But you might have something like
> > > small multiline subtitle text
> > > REALLY LARGE TEXT
> > to be read in that order, but rendered overlapping, with the large text
> > of course below (it would not be readable the other way round).
> > So if you want to actually render them on screen that must be done in
> > the opposite order from how you would read them.
> 
> I do not understand your example

Because you assumed it to be specific or apply to a subtitle format
existence, I guess instead of "convoluted examples" I should have said
"completely made-up, theoretical examples of things that might happen
with some future mis-designed subtitle format" :-(
But probably it does not make sense anyway, it was based on the
assumption that it would be too inefficient to put each item on its own
layer (when converting from a hypothetical format that does not use
collision/collusion detection, and does not use layers but draws in the
order in which the subtitle is stored but does specify a "reading
order").

Greetings,
Reimar D?ffinger