[FFmpeg-devel] Internal handling of subtitles in ffmpeg

Michael Niedermayer michaelni
Fri Jan 2 14:32:18 CET 2009

On Fri, Jan 02, 2009 at 09:18:10AM +0100, Reimar D?ffinger wrote:
> On Fri, Jan 02, 2009 at 02:20:43AM +0100, Michael Niedermayer wrote:
> > On Fri, Jan 02, 2009 at 12:08:00AM +0100, Reimar D?ffinger wrote:
> > > On Thu, Jan 01, 2009 at 10:36:36PM +0100, Michael Niedermayer wrote:
> Most relevant things first:
> > If you want to convince me to drop ASS in AVSubtitleRect and rather use
> > UTF8, for now, i dont think you would have much difficulty convincing me.
> > but that would mean no formating at all for now ...
> I think that's close to what I was thinking about, but note that I am
> considering AVSubtitle as the "public" face of FFmpeg subtitles
> and the big cases I can see are:
> 1) they have a ASS renderer. They want just one ASS string to pass to
> it.
> 2) they have something that can do text. They want text. Possibly in
> logical order, possibly with coordinates (movie-absolute will do, even
> if badly).
> 3) they can display or blend bitmaps. They want bitmaps as currently
> implemented.

what about?

Index: libavcodec/avcodec.h
--- libavcodec/avcodec.h	(revision 16398)
+++ libavcodec/avcodec.h	(working copy)
@@ -2375,15 +2375,34 @@
 } AVPaletteControl attribute_deprecated;
+enum AVSubtitleType {
+    BITMAP_SUBTITLE,                ///< A bitmap, pict will be set
+    TEXT_SUBTITLE,                  ///< Plain text, the text and ass fields will be set
+    ASS_SUBTITLE,                   ///< Text+formating, the text and ass fields will be set
 typedef struct AVSubtitleRect {
     uint16_t x;
     uint16_t y;
     uint16_t w;
     uint16_t h;
     uint16_t nb_colors;
-    int linesize;
-    uint32_t *rgba_palette;
-    uint8_t *bitmap;
+    /**
+     * data+linesize for the bitmap of this subtitle.
+     * can be set for text/ass as well once they where rendered
+     */
+    AVPicture pict;
+    enum AVSubtitleType type;
+    char *text;                     ///< 0 terminated plain UTF-8 text
+    /**
+     * 0 terminated ASS/SSA compatible event line.
+     * The pressentation of this is unaffected by the other values in this
+     * struct.
+     */
+    char *ass;
 } AVSubtitleRect;
 typedef struct AVSubtitle {

> I'd prefer those to be available directly without having 50 more fields
> in the struct I have to read up on first to know if they are relevant
> for me and 10 conversion functions I have to call at the right place.
> > > Well, you know, I am trying to convince you to say: hell, let's do the
> > > simple stuff simple and proper and leave the rest to a complicated
> > > extension.
> > 
> > Iam still waiting for you to explain your simple & proper solution. It seems
> > what you suggest has changed somewhat so iam not entirely sure if you still
> > argue in favor of replacing decoder->encoder by bitstream filters or what
> > the intermediate format is supposed to be, originally you suggested ASS but it
> > seems you dont suggest this anymore?
> I am arguing that "the one subtitle representation" should not exist.
> Unfortunately that is not possible if conversion between all imaginable
> formats should work.

> That is why I came up with the "one ASS string blob (possibly in AVSubtitle)"
> because that at least hides it well (putting it in AVSubtitleRect does
> not hide it well because that will affect the meaning of x,y - actually
> already just allowing text in there will though).

There is a problem with putting a single ASS blob in AVSubtitle. And this is
the main reason why iam advocating to put it in AVSubtitleRect.
What if 2 such blobs start at the same time, they would end in the same packet
and thus the decoder would receive 2 ASS blobs.
If you volunteer to fix the issues caused by having multiple packets with
equal timestamps, we surely can put it in AVSubtitle, if the result would
qualify as simple of course is another question ...

> As to why I am arguing against it? I think that "the one subtitle
> representation" is most likely to lead to an API where writing your own
> decoder is less hassle than learning to use the API correctly.
> I ended the discussion because that is of course an impossible argument
> to make against a non-existent API, no matter what a convoluted example
> one makes someone else can say "well, but there is special function XY
> just for that!".
> Also I did not really feel that anyone gained any real insight by
> the "convoluted examples" I made (if you find them useful, I have two
> more at the end).
> I guess it comes down to design philosophy:
> > simple & proper solution
> I'm advocating a simple solution, and someone else may add as many hacks outside my view
> to handle the "proper" (supporting everything) stuff.
> I'd say you want a solution designed to be simple and proper, though to
> me it seems likely that anyone involved in the long discussion about it
> will still be able to judge if it is simple.

The problems i have with your solution is
1. You only vaguly describe it, and it changes from argument to argument,
   This makes it impossible to implement or compare properly against other

2. Using bitstream filters is not simple, even less so with the current
   requiement of manual addition of them, but then i dont even know if you
   still suggest that we should use them instead of decoder/encoder

3. putting a single ass blob like you suggest in AVSubtitle is something
   simple as long as one doesnt consider that a decoder (currently) may have
   more than 1 as input.

if now considering 3 we allow several ass blobs per AVSubtitle and considering
2 we dont use bitstream filters, then your suggestion seems identical to mine.
If not id like you to elaborate on what the remaining difference is.

Besides this i would be very happy if you would provide some more concrete
suggestions for how to implement things. I really like to move forward and
for this i need something i can implement not some vague description.

> > > > The advantage is the same that there is for using AVCodecContext instead of
> > > > using a char* of an mpeg4 header to represent the related info.
> > > > it would very well be possible to make our mpeg2 decoder convert width/height
> > > > and so on into a mpeg4 bitstream and export that ...
> > > > Its just that working with int, float, ... is easier than parsing bitstreams
> > > > or strings
> > > 
> > > But that is exactly the point! Width and height for video are always
> > > simple ints, but once they could be arbitrary formulas wouldn't all you
> > > do just be inventing yet another encoding for the formulas?
> > 
> > i dont understand what you try to say.
> > I was arguing to export values through a struct instead of a char* using a
> > using a complex encoding.
> And I say: If the problem is difficult enough, your struct becomes just
> yet another complex encoding, i.e. you win a minor simplification by doubling
> the number of representations.

Let me be rude, did you ever read the ASS/SSA spec? Or do you just assume
its a magic black box that has hundreads of obscene fields in convoluted

> > > > Besides if some information from mpeg2 has no place in mpeg4, its a lot easier
> > > > to add the extra field or value to a struct than to find some way to squeeze
> > > > it in a string or bitstream.
> > > 
> > > What if MPEGn used XML structs with user defined elements that only very
> > > few people need? Would it still be the best way to export it that way
> > > when it muddles the API instead of just letting the people who want the
> > > really difficult things bear a bit more pain?
> > 
> > if there where xml in mpeg, i would see no problem exporting this in a
> > new and seperate field. Users wanting it could get it from there, others
> > could ignore it.
> That is what I want your "general purpose subtitle representation" to
> be: something that was primarily designed so it can be ignored, or
> alternatively you fail hard - but not something that most likely will be
> used in a way that is half working and half broken.

i want that as well.

> Greetings,
> Reimar D?ffinger

> The two more "convoluted examples":
> 1) If there is some text and the same text again representing its
> shadow, should they be in the same AVSubtitleRect? If they're not that
> makes it harder to remove the "shadow" if you output it non-graphically.
> You might also render the glyphs more often than necessary if the shadow
> has the same size and shape.
> If they are, they may be far apart and the coordinates x/y are wrong for
> at least one.

The shadow will be in a seperate AVSubtitleRect if and only if it is encoded
For ASS it will be in the same AVSubtitleRect when ASS contains a single
event with a shadow, ASS supports this and its how i suspect people encode
shadows in ASS.
If some moron stored the shadow as a seperate event line in ASS or equivalent
in a non ass format, then independant of the API your decoder will
output 2 seperate objects. Unless you merge them which i do not really belive
you are advocating here.
x and y should not be any more correct or wrong due to effects like shadows
being applied.
In practice of course x and y likely may be just 0,0 after the decoder and
only be set once the text is rendered to a bitmap.

so really i dont see how your example could in any way be related to the API
at all, assuming you dont argue to split a "Hello world"+flag_shadow into 2
and not argue to compare all events in a file to merge things appearing like
That is no matter what API a decoder will give you 2 object when there are
2 objects stored and 1 object when there is 1 object stored.

> 2) Next, should the AVSubtitleRects be in logical order or rendering order?
> Your examples for the simplest text rendering assumed logical order.
> But you might have something like
> > small multiline subtitle text
> to be read in that order, but rendered overlapping, with the large text
> of course below (it would not be readable the other way round).
> So if you want to actually render them on screen that must be done in
> the opposite order from how you would read them.

I do not understand your example

The specsays (for event lines)

Field 1:    Layer (any integer)

                 Subtitles having different layer number will be ignored
        during the collusion detection.

                 Higher numbered layers will be drawn over the lower
thus rendering order is more or less set by the ass event lines,
and i dont see how the order of AVSubtitleRects would affect this much.


Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Awnsering whenever a program halts or runs forever is
On a turing machine, in general impossible (turings halting problem).
On any real computer, always possible as a real computer has a finite number
of states N, and will either halt in less than N cycles or never halt.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090102/461b6f56/attachment.pgp>

More information about the ffmpeg-devel mailing list