[FFmpeg-devel] Internal handling of subtitles in ffmpeg

Michael Niedermayer michaelni
Thu Jan 1 18:56:45 CET 2009

On Thu, Jan 01, 2009 at 06:16:07PM +0100, Reimar D?ffinger wrote:
> On Thu, Jan 01, 2009 at 04:19:49PM +0100, Michael Niedermayer wrote:
> > Let me summarize what i remember from your standpoint, please correct
> > me if i misremember something
> > 1. decoders should output bitmaps 
> > 2. bitstream filters should convert betweem X->ASS and ASS->X
> Actually, I think bitstream filters, at least how they are done
> currently are horrible for usability.

Iam well aware of this, but iam not aware of any suggestions let alone
patches to improve the situation. Only that occasionally people, me
included refer to some mysterious automatic addition of bsfs that should
be implemented ...
So if you have any suggestions or patches for this, they are surely welcome

> I was just thinking in terms of "quick solution that looks like a
> sensible template for a future good solution"
> > My suggestion
> > 1. decoders output vector based AVSubtitleRects containing ASS or bitmaps
> > 1b. encoders take vector based AVSubtitleRects containing ASS or bitmaps
> > 2.  A renderer can converts ASS AVSubtitleRects to bitmaps
> > 
> > You say "this feels like a horribly complex way to pass around of strings
> > without much of an advantage", can you please elaborate on this?
> The "horribly complex way" passing around AVSubtitleRects with text and
> coordinates.
> I think I'd already be mostly okay if the char * argument was in
> AVSubtitle and not AVSubtitleRects (because I do not even remotely see
> a rectangular position as an inherent property of some text - though
> actually that would be true also of any non-trivial bitmap subtitle
> format if such a thing existed).

> > The concrete problems i see with your design are
> > 1. The current architecture is demuxer->decoder->encoder->muxer
> >    considering that your decoders return bitmaps its no longer possible to
> >    encode these to text, thus breaking the "demuxer->decoder->encoder->muxer"
> This assumes that you want to treat subtitles "exactly" like video/audio
> which is somewhat questionable (lossless vs. lossy etc.).
> Also I can not see it work well with your approach either because
> a encoder after once agreeing on a format (pixfmt, size) will deal with
> all inputs, most subtitle encoder will handle only text or only bitmap,
> and you seem to not want to distinguish between text and bitmap "a
> priori".

I do want, and i did mention this, it seem though this was lost in the heat
of the thread ...
what i suggested was 2 flags indicating if a specific encoder could handle
text / bitmaps in AVCodec.capabilities
this is just a rough suggestion and instead some other fields in AVCodec
could be used.
The whole would be in line with AVCodec.pix_fmts that lists the pixel formats
supported by video encoders, and actually, the very AVCodec.pix_fmts field
could be used to describe which bitmap pixfmts a subtitle encoder supported

> So what you would have to do would be decoder -> (possibly text<->bitmap
> transformer) -> encoder.
> Of course that would be comparable if you consider that "transformer"
> analogous to swscale, but then text-only and bitmap-only subtitle
> formats are as much "the same" as a RGB32 and a YV12 frame (with the
> difference of supporting mixed formats).
> > 2. How should mixed bitmap and text formats be represented?
> >    Your suggestion requires a bitstream filter to convert to ASS and then from
> >    ASS, but does ASS support bitmaps in every pixel format we would need, 
> >    besides how to put this in the char * ?
> I think ASS does not support bitmaps at all, only the next version with
> some other name IIRC. But I'd expect it would also support rotating
> bitmaps with some explicitly specified scaling algorithm and position
> relative to the border of the screen and crazy stuff like that, which
> leads to my original question for text subtitles "how to put this in
> AVSubtitleRects".

adding parameters for affine transformations to AVSubtitleRect is not hard

> And my answer is: AVSubtitleRects is fundamentally designed to only work
> for trivial subtitle formats due to assuming you split the subtitle in
> rectangular areas in a way that makes sense.

Iam seeing AVSubtitleRect as a generic primitive for a vector based
representation of subtitles ...
Maybe renaming AVSubtitleRect to AVSubtitleObject or AVSubtitlePrimitive
would make sense?

> > 3. Does ASS support every way text can be positioned by other formats?
> >    I mean if we convert from X to Y the text should stay at the same
> >    spot on the screen given Y can represent it.
> No idea, I was only claiming that AVSubtitleRects is orders of magnitude
> worse.

Could you elaborate on this

> > Also in the light of "horribly complex", does it not feel horribly complex
> > to require every ASS->X bitstream filter to be able to extract things like
> > position, i mean in my suggestion these would be stored in a easy accessable
> > struct doing the extraction just at one spot.
> And they would be wrong for any "non-trivial" text subtitle.

I think you misunderstand what iam suggesting
I do not suggest to convert "left margin 5, top middle" to (512,50)-(600,100)
but rather store exactly a semantically equivalent for
"left margin 5, top middle" in AVSubtitleRect

If now some encoder cannot represent the true position, it can call some
common code to convert it to a pixel based or 0,0-1,1 screen relative

> > and general case here means
> > text -> text while not loosing effects when the destination supports the
> >     effects
> > text -> bitmaps (not a single 95% transparent screen sized bitmap)
> > bitmaps -> display (with bitmaps not being colorspace converted twice)
> > text+bitmaps -> text+bitmaps
> Well, I just think you'd have to extend this to have at least those
> "basic" subtitle types:
> "DATA blob" (ASS with bitmap support extensions?, not possible to correctly
> represent as AVSubtitleRects, thus not using them - alternatively
> giving up on a common representation format for anything so advanced)
> "trivial" bitmap only (using AVSubtitleRects)
> "trivial" text only (using AVSubtitleRects)
> "trivial" bitmap+text (using AVSubtitleRects)

Please elaborate on what you consider trivial and non trivial, i have
difficulty understanding this.

To me, any way to specify a position in a non ambigous way is equivalent
i mean no matter if text is specified with pixel based margins rectangle
left/right justified flags, screen or display relative coordinates with some
rotation/sheer/... (aka affine transformation) or other.
I simply fail to see the distinction or how one of these would be more
trivial than the others. subtitle formats likely will not support every
way to specific position and thus some convertion will often be needed ...

Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Old school: Use the lowest level language in which you can solve the problem
New school: Use the highest level language in which the latest supercomputer
            can solve the problem without the user falling asleep waiting.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090101/3e5f1cda/attachment.pgp>

More information about the ffmpeg-devel mailing list