[FFmpeg-devel] [RFC] AVSubtitles rework

Thu Sep 20 23:53:49 CEST 2012

On Thu, Sep 20, 2012 at 07:25:11PM +0200, Nicolas George wrote:
> I believe you summarise the situation correctly, at least as far as I am
> aware.
> 
> 
> L'octidi 18 fructidor, an CCXX, Clément Bœsch a écrit :
> > Mmh OK. Well then should we introduce an experimental AVSubtitle2 directly
> > into libavutil to ease the integration with libavfilter later on?
> > 
> > If we are to start a new structure, we should consider designing it the
> > proper way at first, so a subtitle structure being able to store two types
> > of subtitles as we already discussed:
> > 
> >  == bitmap subtitles ==
> > 
> > For the bitmap stuff I don't have much opinions on how it should be done.
> > IIRC, we agreed that the current AVSubtitle structure was mostly fine
> > (since AVSubtitle is designed for such kind of subtitles at first) except
> > that it it is missing the pixel format information, and we were wondering
> > where to put that info (in each AVSubtitle2->rects or at the root of the
> > AVSubtitle2 structure).
> 
> Nothing to add to that on the basic question.
> 
> On the detail question you raised in the last parentheses, I would suggest
> both, with the enforced guarantee that the global pixel format is set if and
> only if all rectangles have the same pixel format.
> 
> > 
> >  == styled events for text based subtitles ==
> > 
> > For the styled text events, each AVSubtitle2 would have, instead of a
> > AVSubtitle->rects[N]->ass an exploitable N AVSubtitleEvent (or maybe only
> > one?).
> 
> If "event" refers to a line of ASS script or a paragraph of SRT file, with
> their start and end timestamps, then I believe that "only one" is the
> correct choice.
> 
> Now, do we allow several stanzas per event, each with its own styled text,
> like bitmaps allow several rectangles? I am not sure. None of the subtitles
> formats I know require it, but it may change in the future. On the other
> hand, "several stanzas" is just an additional level in an abstract tree.
> 

I wasn't thinking about the stanzas but maybe it could have some use in
format like SAMI where you have a "talker field". That field could also
have some markups, and is kind of separate from the text. But IIRC, the
timing is kind of special and it could be a real pain to support. Anyway,
I'm fine with only one event, and in that case a separate thing from the
rects to make it pretty clear.

> >	 This is what the subtitles decoders would output (in a decode2
> > callback for example, depending on how we keep compat with AVSubtitle) and
> > what the users would exploit (by reading that AST to use it in their
> > rendering engine/converter/etc, or simply pass it along to our encoders
> > and muxers). Additionally, we may want to provide a "TEXT" encoder to
> > provide a raw text version (stripping all markups) for simple rendering
> > engine.
> > 
> > So, here is a suggestion of the classic workflow:
> > 
> >                                                      /* common transmuxing/coding path */
> > DEMUXER -> [AVPacket] -> DECODER -> [AVSubtitle2] -> ENCODER -> [AVPacket] -> MUXER
> >                                           |
> >                                           |
> >                         /* lavfi/hardsub or video player path */
> >                                           |
> >                                          / \
> >                                         /   \
> >        custom rendering                /     \
> >        engine using the  <--------- text?  bitmap?
> >       AVSubtitle2->events            /         \
> >            structure                /           \
> >                             libass to render?   bitmap overlay
> >                                  /     \
> >                            yes  /       \ no
> >                                /         \
> >                      ENCODER:assenc   ENCODER:textenc          (<== both lavc encoders)
> >                              /             \
> >    AVPacket->data is an ASS /               \
> >    payload (no timing)     /                 \ AVPacket->data is raw text
> >  (need to mux for timings)/                   \
> >                          /                     \
> >                  libass:parse&render    freetype/mplayer-osd/etc
                                            ^^^^^^^^^^^^^^^^^^^^^^^^
                                      edit: I forgot to add printf here! ;)
> 
> That looks mostly right. I wonder if we should really require using an
> encoder to produce text or ASS packets, or directly provide an API to do so:
> av_subtitle_to_ass(AVSubtitle2 *sub, char **ass);
> av_subtitle_to_text(AVSubtitle2 *sub, char **text);
> The code would be the same, it would only be a different entry point, so it
> can be discussed later.
> 

Ah yeah I remember we discussed this a while ago; I wasn't sure how/where
you would provide this. And sure, we could have that kind of helper on top
on this design as long as that workflow remains. Though, without any
encoder context, I wonder how you would achieve that without any problem.

> > At least, that's how I would see the usage from a user perspective.
> > 
> > Now if we agree with such model, we need to focus on how to store the
> > events & styles. Basically, each AVSubtitle2 must make available as AST
> > the following:
> 
> Note that it does not need to be a tree. It could be just a single big UTF-8
> string with a list of (start, end, style) spans.
> 

And you would use a special bin escape code to refer to the span
information? Why not, maybe a linked list of spans would be more
appropriate, dunno.

> >  - an accessible header with all the global styles (such as an external
> >    .css for WebVTT, the event styles in the ASS header, palettes with some
> >    formats, etc.); maybe that one would belong in the AVCodecContext
> 
> It must be in AVCodecContext because the encoders will need it at init stage
> to create the extradata.
> 

Right, ok.

> >  - one (or more?) events with links to styles structure: either in the
> >    global header, or associated with that specific event. BTW, these
> >    "styles" info must be able to contain various information such as
> >    karaoke or ruby stuff (WebVTT supports that,
> >    https://en.wikipedia.org/wiki/Ruby_character)
> 
> Seems right.
> 
> > We still need to agree on how to store that (and Nicolas already proposed
> > something related already), but I'd like to check if everyone would agree
> > with such model at first. And then we might engage in the API for text
> > styling.
> 
> The issue is with "flattening" the styles. Currently, if I convert ASS to
> SRT, I get this all over the place:
> 
> <font face="DejaVu Serif" size="22">I know that.</font>
> 
> just because the Default style defines the font to be DejaVu Serif at size
> 12. We want some kind of "relevance" field that would allow to discard
> uninteresting styles when the format can not express them efficiently. And
> of course, it needs to be user-settable. I believe this is the hardest part
> of the design.
> 

In that particular case, the SRT encoder could just ignore any style
referring to the global headers settings, but I'm not really sure what's
the problem here: the input wants to change the default (which is "Arial"
IIRC in the case of ASS), so I think it's important to honor this font
here (the author of the style clearly wants a particular style).

OTOH, if that font face was defined to "Arial", it might be indeed wise
for the SRT encoder to ignore that style. I think a simple workaround for
this problem could be to make the encoder mark the event customization
span (which is stored in the global header or in the event) as "default".

Here is a sample prototype of such model:

    enum {
        AVSUBTITLE_SETTING_TYPE_RAW_TEXT      = MKBETAG('t','e','x','t'),
        AVSUBTITLE_SETTING_TYPE_COMMENT       = MKBETAG('c','o','m',' '),
        AVSUBTITLE_SETTING_TYPE_TIMING        = MKBETAG('t','i','m','e'),

        AVSUBTITLE_SETTING_TYPE_FONTNAME      = MKBETAG('f','o','n','t'),
        AVSUBTITLE_SETTING_TYPE_FONTSIZE      = MKBETAG('f','s','i','z'),

        AVSUBTITLE_SETTING_TYPE_COLOR         = MKBETAG('c','l','r','1'),
        AVSUBTITLE_SETTING_TYPE_COLOR_2       = MKBETAG('c','l','r','2'),
        AVSUBTITLE_SETTING_TYPE_COLOR_OUTLINE = MKBETAG('c','l','r','O'),
        AVSUBTITLE_SETTING_TYPE_COLOR_BACK    = MKBETAG('c','l','r','B'),

        AVSUBTITLE_SETTING_TYPE_BOLD          = MKBETAG('b','o','l','d'),
        AVSUBTITLE_SETTING_TYPE_ITALIC        = MKBETAG('i','t','a','l'),
        AVSUBTITLE_SETTING_TYPE_BORDER_STYLE  = MKBETAG('b','d','e','r'),
        AVSUBTITLE_SETTING_TYPE_OUTLINE       = MKBETAG('o','u','t','l'),
        AVSUBTITLE_SETTING_TYPE_SHADOW        = MKBETAG('s','h','a','d'),
        AVSUBTITLE_SETTING_TYPE_ALIGNMENT     = MKBETAG('a','l','g','n'),
        AVSUBTITLE_SETTING_TYPE_MARGIN_L      = MKBETAG('m','a','r','L'),
        AVSUBTITLE_SETTING_TYPE_MARGIN_R      = MKBETAG('m','a','r','R'),
        AVSUBTITLE_SETTING_TYPE_MARGIN_V      = MKBETAG('m','a','r','V'),
        AVSUBTITLE_SETTING_TYPE_ALPHA_LEVEL   = MKBETAG('a','l','p','h'),
        AVSUBTITLE_SETTING_TYPE_ENCODING      = MKBETAG('e','n','c',' '),
    } AVSubtitleXSettingType;

    typedef struct {
        int type; ///< one of the AVSUBTITLE_SETTING_TYPE_*
        int default;
        union {
            char *s;
            double d;
            int i;
            int64_t i64;
            uint32_t u32;
        };
    } AVSubtitleXSetting;

    typedef struct {
        char *name;
        int nb;
        AVSubtitleXSetting *v;
    } AVSubtitleXSettings;

    typedef struct {
        AVSubtitleXSettings *g_settings;
        AVSubtitleXSettings *chunks;
    } AVSubtitleXEvent;

Basically every AVSubtitleXEvent as a list of chunk styles, and share with
the other events a pointer to the global settings which are other
definitions of subtitles settings. Now note the default field in the
AVSubtitleXSetting:

 - If the SRT encoder sees a setting of type font with default, it doesn't
   print anything.
 - If the ASS encoder sees a setting of type font with default, it just
   prints its default (if required) in the style definition/event; in its
   case, "Arial".

The encoder could even provide the default value from the source markup,
but I'm not sure that would be useful. Nothing prevent the encoder to do
it anyway.

Maybe I misunderstood something in the problem you exposed though.

Thanks a lot for your comment!

-- 
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20120920/db27119f/attachment.asc>