[FFmpeg-devel] [RFC] AVSubtitles rework
nicolas.george at normalesup.org
Thu Sep 20 19:25:11 CEST 2012
I believe you summarise the situation correctly, at least as far as I am
L'octidi 18 fructidor, an CCXX, Clément Bœsch a écrit :
> Mmh OK. Well then should we introduce an experimental AVSubtitle2 directly
> into libavutil to ease the integration with libavfilter later on?
> If we are to start a new structure, we should consider designing it the
> proper way at first, so a subtitle structure being able to store two types
> of subtitles as we already discussed:
> == bitmap subtitles ==
> For the bitmap stuff I don't have much opinions on how it should be done.
> IIRC, we agreed that the current AVSubtitle structure was mostly fine
> (since AVSubtitle is designed for such kind of subtitles at first) except
> that it it is missing the pixel format information, and we were wondering
> where to put that info (in each AVSubtitle2->rects or at the root of the
> AVSubtitle2 structure).
Nothing to add to that on the basic question.
On the detail question you raised in the last parentheses, I would suggest
both, with the enforced guarantee that the global pixel format is set if and
only if all rectangles have the same pixel format.
> == styled events for text based subtitles ==
> For the styled text events, each AVSubtitle2 would have, instead of a
> AVSubtitle->rects[N]->ass an exploitable N AVSubtitleEvent (or maybe only
If "event" refers to a line of ASS script or a paragraph of SRT file, with
their start and end timestamps, then I believe that "only one" is the
Now, do we allow several stanzas per event, each with its own styled text,
like bitmaps allow several rectangles? I am not sure. None of the subtitles
formats I know require it, but it may change in the future. On the other
hand, "several stanzas" is just an additional level in an abstract tree.
> This is what the subtitles decoders would output (in a decode2
> callback for example, depending on how we keep compat with AVSubtitle) and
> what the users would exploit (by reading that AST to use it in their
> rendering engine/converter/etc, or simply pass it along to our encoders
> and muxers). Additionally, we may want to provide a "TEXT" encoder to
> provide a raw text version (stripping all markups) for simple rendering
> So, here is a suggestion of the classic workflow:
> /* common transmuxing/coding path */
> DEMUXER -> [AVPacket] -> DECODER -> [AVSubtitle2] -> ENCODER -> [AVPacket] -> MUXER
> /* lavfi/hardsub or video player path */
> / \
> / \
> custom rendering / \
> engine using the <--------- text? bitmap?
> AVSubtitle2->events / \
> structure / \
> libass to render? bitmap overlay
> / \
> yes / \ no
> / \
> ENCODER:assenc ENCODER:textenc (<== both lavc encoders)
> / \
> AVPacket->data is an ASS / \
> payload (no timing) / \ AVPacket->data is raw text
> (need to mux for timings)/ \
> / \
> libass:parse&render freetype/mplayer-osd/etc
That looks mostly right. I wonder if we should really require using an
encoder to produce text or ASS packets, or directly provide an API to do so:
av_subtitle_to_ass(AVSubtitle2 *sub, char **ass);
av_subtitle_to_text(AVSubtitle2 *sub, char **text);
The code would be the same, it would only be a different entry point, so it
can be discussed later.
> At least, that's how I would see the usage from a user perspective.
> Now if we agree with such model, we need to focus on how to store the
> events & styles. Basically, each AVSubtitle2 must make available as AST
> the following:
Note that it does not need to be a tree. It could be just a single big UTF-8
string with a list of (start, end, style) spans.
> - an accessible header with all the global styles (such as an external
> .css for WebVTT, the event styles in the ASS header, palettes with some
> formats, etc.); maybe that one would belong in the AVCodecContext
It must be in AVCodecContext because the encoders will need it at init stage
to create the extradata.
> - one (or more?) events with links to styles structure: either in the
> global header, or associated with that specific event. BTW, these
> "styles" info must be able to contain various information such as
> karaoke or ruby stuff (WebVTT supports that,
> We still need to agree on how to store that (and Nicolas already proposed
> something related already), but I'd like to check if everyone would agree
> with such model at first. And then we might engage in the API for text
The issue is with "flattening" the styles. Currently, if I convert ASS to
SRT, I get this all over the place:
<font face="DejaVu Serif" size="22">I know that.</font>
just because the Default style defines the font to be DejaVu Serif at size
12. We want some kind of "relevance" field that would allow to discard
uninteresting styles when the format can not express them efficiently. And
of course, it needs to be user-settable. I believe this is the hardest part
of the design.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 198 bytes
Desc: Digital signature
More information about the ffmpeg-devel