[FFmpeg-devel] Format of decoded subtitles (was: matroska: Identify S_TEXT/UTF-8 tracks as SRT and not TEXT.)

Thu May 31 17:07:33 CEST 2012

Le sextidi 6 prairial, an CCXX, Clément Bœsch a écrit :
> So if I understand well, you would propose a model with libsubconvert
> doing any kind of markup conversion instead of the current model where the
> decoder is "encoding" the event in ASS, bitmap or text?

Well, it does not need to be a separate library per se, but I really think
we need some kind of:

	ctx = avsub_markup_convert_init(ASS, HTML);
	avsub_markup_convert(ctx, sub_ass, sub_html);

or something.

> It should, for text-based subtitles. At least for the "useful" markup. But
> I admit ASS has some annoying limitations, especially with some particular
> subtitles features:
> 
>  - the first one I have in mind is that there is no text representation
>    for the "last up to the next subtitles" feature. Example: MicroDVD (and
>    SAMI which I'm working on ATM) have features like this:
> 
>    {500}{600}this is printed starting at frame 500 and last until frame 600
>    {1234}{}this starts being displayed at frame 1234...
>    {1400}{}...and will be "replaced" by this text until the end.
> 
>    We can express this in the AVPacket (pkt.duration = -1 for example),
>    but to encode the ASS event, it's not possible to have 00:01:02:03
>    -1:-1:-1:-1 for instance. So we need to workaround this.

I am not sure I follow you: this is not markup, this is timing, and IMHO,
timing belongs in the demuxer and should be decoded by it. For the example,
the demuxer should output packets like that:

  { .pts =  500, .duration = 100, .data = "this is printed starting..." },
  { .pts = 1234, .duration = 166, .data = "this starts being displayed..." },
  { .pts = 1400, .duration = PTS_MAX - 1400, .data = "...and will..." },

>  - One random limitation against SAMI: this insane HTML-based format
>    (actually not HTML at all, but full CSS2 compliant...), has two
>    subtitles place holders. Basically it's two subtitles in one (one to
>    print the talker name, and one for what's being said), relying on
>    various presentation markup expectation which ASS can't honor (I don't
>    want to try converting <table> into ASS markup for example).
> 
>  - Other crazy, but of limited usefulness: <img> tag in SAMI (yes...) or
>    even in JACOSub.

Even if they are crazy and we will never support them for rendering, we need
to support them for encoding and decoding and stream copy. Therefore, I do
not believe we can use ASS as an universal markup.

The pseudo-HTML of SRT, OTOH, can pretty well be converted into ASS and
back.

But considering ASS, I am quite unsure about what part of the line should go
into the decoded text. IMHO, "Start" and "End" should not (they are timing,
not markup), but the other fields affect the markup.

>  - Last one is the precision limitation we already talked about (tb 1/100
>    for ASS, and 1/1000 for ones like SRT).

Again, timing, not markup.

> For this particular use-case, isn't it possible to just alter the demuxed
> subtitles packet (ASS, text or bitmap field) before feeding it to the
> decoder doing the rendering?

It is possible, clearly, but I consider it a hack.

> Ah yes this was one of the issue: but don't we have width/height
> information in the ASS header?

No. The width/height fields in the ASS header define a coordinate system for
the various positioning informations present, but does not presume the
actual resolution of the video. Only, in the best of cases, its aspect
ratio.

> This makes me realize it is a problem if we have a subtitles format which
> says "use the input video resolution to do <whatever>" (and don't have it
> stored in its header like ASS) and we need to generate a ASS extradata
> header with the input video resolution we don't have access to (or do
> we?).

If we convert something to ASS, we need to generate some headers extradata,
but not for the resolution: for the styles. And the resolution is just an
element of style: font size 32 at 1024×576 is totally equivalent to font
size 16 at 512×288.

> Using lavfi would solve the problem?

I am not sure. lavfi does not seem well suited for seeking.

> Well rendering can be done conditionally, even in the core. And anyway, I
> don't expect to use anything else that libass for doing it.

Making core features optional is not very user-friendly.

> BTW, I don't think the encoders for (text) subtitles are much needed:
> most people just want to render SubRip, MicroDVD and ASS, and eventually
> convert an old deprecated not widely used format to a more modern one with
> markup like ASS. This is why I emit some doubt about a "libsubconvert"
> meant to do all kind of crazy convert.

There is no need to make anything crazy. We probably want a converter for
anything -> ASS, and whenever convenient, for ASS -> anything. And if format
A and B are similar and very different from ASS, A -> B and B -> A.

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20120531/12ccfe5b/attachment.asc>