[FFmpeg-devel] [PATCH] [WIP] avformat/assdec: UTF-16 support

Clément Bœsch u at pkh.me
Sun Mar 30 11:25:31 CEST 2014


On Sun, Mar 30, 2014 at 11:22:07AM +0200, wm4 wrote:
> On Sun, 30 Mar 2014 10:29:35 +0200
> Clément Bœsch <u at pkh.me> wrote:
> 
> > On Fri, Mar 28, 2014 at 07:33:25PM +0100, wm4 wrote:
> > > This attempts to add UTF-16 subtitle support in the most simple way
> > > possible. It does so by replacing avio_r8() with ff_text_r8(), which
> > > converts UTF-16 on the fly to UTF-8. If the source is not UTF-16,
> > > it practically wraps avio_r8() without change.
> > > 
> > > This uses the BOM to recognize UTF-16 files. In practice, all UTF-16
> > > text files have a BOM. (I planned to use a somewhat more robust method
> > > to ddtect UTF-16, similar to MPlayer's subreader, but libavformat's
> > > architecture doesn't allow this easily.)
> > > 
> > > This also takes care of skipping the BOM properly in the UTF-8 case.
> > > Skipping the BOM is somewhat hard, because AVIOContext does not allow
> > > any readahead. Since ff_text_r8() includes its own read buffer (in case
> > > bytes were read that don't belong to a BOM), this becomes trivial.
> > > 
> > > The functionality added with this patch could be used to extend other
> > > subtitle formats with UTF-16 support.
> > > 
> > > It might be possible to implement the functionality provided by
> > > FFTextReader as custom AVIOContext, but I refrained from that
> > > because it's not easily possible to to return the correct stream
> > > position with this, and it also seemed too roundabout.
> > > ---
> > >  libavformat/assdec.c    | 21 ++++++++++++------
> > >  libavformat/subtitles.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++
> > >  libavformat/subtitles.h | 13 +++++++++++
> > >  3 files changed, 84 insertions(+), 7 deletions(-)
> > > 
> > 
> > This sounds good to me mostly. Now the thing is, ff_subtitles_read_chunk()
> > (used in mpsub, srt, and webvtt) should be adjusted to use that API. Same
> > for ff_smil_extract_next_chunk() which is used for SAMI and realtext (we
> > have utf16 sami samples). Finally, the most common one, ff_get_line()
> > should be adjusted as well. The problem with this last one is that it
> > doesn't use an AVBPrint buffer, so it might be a bit more tricky.
> 
> I'm not sure about ff_get_line(). It works on an AVIOContext. So I see
> the following possibilities (pick your favorite one):
> 
> a) Don't use ff_get_line() in subtitle parsers, and use a new function
>    based on ff_subtitles_next_line. (Also, ff_get_line() doesn't handle
>    Mac or DOS line breaks.)

Sorry, that's what I was assuming by "adjusting ff_get_line".

> b) Actually replace FFTextReader by a "converting" AVIOContext (either
>    by implementing FFTextReader as a custom AVIOContext backend, or
>    changing AVIOContext directly).
> 
> The simplest would be a), while b) sounds a bit involved. b) also makes
> it harder to get the original file position for AVPacket.pos.
> 
> > I'm not asking for those to be changed for the patch to be accepted, but
> > that's how I'd see a complete UTF-16 support for subtitles.
> > 
> > This code you introduce is definitely an improvement, thanks.
> > 
> > Your patch states a work in progress. What's your plan for the next
> > iteration?
> 
> Extending it to the other subtitle parsers. And maybe adding doxygen
> and splitting the commit.
> 

OK

> By the way, for which formats do you think UTF-16 is needed? I've
> spotted only ASS and SAMI UTF-16 files in the wild. Somehow I'd also
> expect to find such SRT and MicroDVD files, but I think I haven't seen
> such yet. For all other formats, it's probably a waste of time.

Same, ASS+SAMI, I think that's all I saw so far, and pretty sure there are
SRT as well.

-- 
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140330/19f53bcf/attachment.asc>


More information about the ffmpeg-devel mailing list