[FFmpeg-devel] Format of decoded subtitles (was: matroska: Identify S_TEXT/UTF-8 tracks as SRT and not TEXT.)

Clément Bœsch ubitux at gmail.com
Thu May 24 16:15:41 CEST 2012

On Thu, May 24, 2012 at 02:19:02PM +0200, Nicolas George wrote:
> Le quartidi 4 prairial, an CCXX, Clément Bœsch a écrit :
> > Most players use ASS rendering for every subtitles (assuming a conversion
> > of the original subtitles markup into ASS), which is BTW what we do in our
> > text subtitles decoders (SubRip, MicroDVD, and JacoSUB). ASS rendering is
> > expected by most people even for these formats.
> > 
> > ASS also handles mostly every "useful" markups (of course I have a bunch
> > of exceptions in mind) at the moment. If a new subtitle format is meant to
> > replace ASS, it will likely keep some kind of retro compatibility with it
> > (otherwise it will be a pain for almost every current decoders/players),
> > and so moving our internal formats to this new one should not be much a
> > problem.
> > 
> > I'm not sure about what you mean by handling the markup syntax the same
> > way we handle pixel/sample formats.
> What I meant was this: in AVFrame, the decoded video is in arrays of
> integers, but there is a pix_fmt field that says if these arrays are YUV420P
> or RGBA. If we have one and want the other, there is libswscale to do the
> conversion; sometimes it is lossless, sometimes it is not.
> For decoded text subtitles, there would be a markup_syntax field with values
> like SUB_MARKUP_ASS or SUB_MARKUP_HTML. And an API to convert, losslessly or
> not, from one markup to another.
> Of course, if we have a perfect round-trip MARKUP_X -> MARKUP_Y -> MARKUP_X
> (this can happen even if Y has features that X does not have, as they will
> not be used in an Y converted from X; OTOH, if Y is case-sensitive and X is
> not, we may lose the case information, which may be considered acceptable),
> then MARKUP_X is useless and we can always convert to and from Y.
> (This is not true for video, we can not convert everything to 32-bits per
> component because of performance issues.)

So if I understand well, you would propose a model with libsubconvert
doing any kind of markup conversion instead of the current model where the
decoder is "encoding" the event in ASS, bitmap or text?

I don't think we really need to change this, I'm not sure to see the
direct benefit.

> If, as you say, the ASS markup can express all the features of any other
> known markup, then we can adopt ASS as an universal markup syntax, and
> expect all subtitles codec to encode/decode the markup.

It should, for text-based subtitles. At least for the "useful" markup. But
I admit ASS has some annoying limitations, especially with some particular
subtitles features:

 - the first one I have in mind is that there is no text representation
   for the "last up to the next subtitles" feature. Example: MicroDVD (and
   SAMI which I'm working on ATM) have features like this:

   {500}{600}this is printed starting at frame 500 and last until frame 600
   {1234}{}this starts being displayed at frame 1234...
   {1400}{}...and will be "replaced" by this text until the end.

   We can express this in the AVPacket (pkt.duration = -1 for example),
   but to encode the ASS event, it's not possible to have 00:01:02:03
   -1:-1:-1:-1 for instance. So we need to workaround this.

 - One random limitation against SAMI: this insane HTML-based format
   (actually not HTML at all, but full CSS2 compliant...), has two
   subtitles place holders. Basically it's two subtitles in one (one to
   print the talker name, and one for what's being said), relying on
   various presentation markup expectation which ASS can't honor (I don't
   want to try converting <table> into ASS markup for example).

 - Other crazy, but of limited usefulness: <img> tag in SAMI (yes...) or
   even in JACOSub.

 - Last one is the precision limitation we already talked about (tb 1/100
   for ASS, and 1/1000 for ones like SRT).

> > BTW, I had in mind something about subtitles: I think the decode subtitles
> > API should do the ASS rendering if possible; calling
> > avcodec_decode_subtitles() with a "render_ass" flag to decode
> > ASS-compliant subtitles (aka the decoder returns ASS packets) into a
> > bitmap layout ready-to-blit by the player/transcoder (ffplay can already
> > do that kind of subtitles bitmap rendering). It might avoid some pain with
> > lavfi (except hardsubbing, does anyone see any more potentially useful
> > subtitles filtering for lavfi?).
> > 
> > Last time I looked for this solution I expected quite a few problems
> > (which I can't remember now I admit), but maybe it's worth looking at this
> > again.
> There are a lot of issues with that:
> First, an application may want to alter the subtitles before rendering them
> (stupid example: use an automated translation system), so we at lease need
> an entry point for that. That is not much of a problem.

For this particular use-case, isn't it possible to just alter the demuxed
subtitles packet (ASS, text or bitmap field) before feeding it to the
decoder doing the rendering?

> Second, there is the issue Reimar raised when I implemented multi-rectangle
> rendering in mplayer a few weeks back: subtitles often occupy a small
> proportion of the whole video, but the closest-fit rectangle may be huge.
> Performance-wise, this is not very good.

> Third, rendering vectorial contents requires the target resolution, and that
> depends on where exactly in the filter sequence the overlay is applied.

Ah yes this was one of the issue: but don't we have width/height
information in the ASS header?

This makes me realize it is a problem if we have a subtitles format which
says "use the input video resolution to do <whatever>" (and don't have it
stored in its header like ASS) and we need to generate a ASS extradata
header with the input video resolution we don't have access to (or do

> Fourth, it needs to handle overlapping subtitles. Even with seeking.

Using lavfi would solve the problem?

> Fifth, we can not have ffmpeg depend on an external library like libass for
> one of its core features. Even worse: for correct regression testing, we
> would need internal handling of fonts.

Well rendering can be done conditionally, even in the core. And anyway, I
don't expect to use anything else that libass for doing it.

> Hum, it looks like I am bashing your suggestion; it is not my purpose. Your
> suggestion has a lot of merits.

I was wondering for long why this method wasn't preferred. This kind of
critics are useful :)

BTW, I don't think the encoders for (text) subtitles are much needed:
most people just want to render SubRip, MicroDVD and ASS, and eventually
convert an old deprecated not widely used format to a more modern one with
markup like ASS. This is why I emit some doubt about a "libsubconvert"
meant to do all kind of crazy convert.

Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20120524/85fa610c/attachment.asc>

More information about the ffmpeg-devel mailing list