[FFmpeg-devel] Discussion: Feature: Subtitle charenc detection

Thu Oct 23 16:09:27 CEST 2014

On Thu, 23 Oct 2014 13:56:54 +0200
Nicolas George <george at nsup.org> wrote:

> Le duodi 2 brumaire, an CCXXIII, Rodger Combs a écrit :
> > As mentioned in https://trac.ffmpeg.org/ticket/4054#comment:1
> 
> Let me quote for completeness:
> 
> 11rcombs:
> >>> Sometimes, especially when ffmpeg is being called programmatically, it is
> >>> difficult or impossible for the caller (or user) to know the character
> >>> encoding of a subtitle file. It'd be useful for libavformat to provide a
> >>> mechanism to detect the encoding if an option is set, using some
> >>> combination of universalchardet, enca, or libguess.
> 
> gjdfgh:
> >> Some things to note:
> >> * no subtitle charset detector is good/sufficient, and you will always
> >>   have the situation in which you have multiple guesses, and you want the
> >>   user to select which guess, etc.
> >> * I think it's wrong to add detection directly to (or below) the subtitle
> >>   demuxers - instead, maybe there should be a function to guess subtitle
> >>   codec from a list of packets (you could provide a convenience function
> >>   which does that using the libavformat internal packet queue)
> >> * the actual subtitle conversion should be somewhere else too, and maybe
> >>   work on the packets (or you could set it as sub charset option in
> >>   libavcodec, forgot the option name) 
> >> Also, this should probably be discussed on the mailing list. The bug
> >> tracker sucks for this purpose.
> 
> 
> > There are a lot of nuances to this, it'll require linking at least one
> > (and possibly 3 or more) new dependencies, and it'll probably require at
> > least some changes to existing subtitle decoders.
> 
> AFAIK, the problem only happens with stand-alone text subtitles files.
> Formats that support muxed text subtitles usually specify the character
> encoding.
> 
> For stand-alone text files, the best approach IMHO is to have an API to just
> read text files, taking care of all annoying details (such as encoding, but
> not only: line endings, BOM, etc.), and the symmetric API for writing.
> 
> The subtitles demuxers would only need to use that API, which is not very
> difficult as they already all use common code to read entire files.
> 
> The API would also benefit other places in the code, like for the textfile
> option for the drawtext filter.
> 
> I had a proposal some time ago, but it did not have all the promised bells
> and whistles yet and was taken by so much bikeshedding that I had to put it
> on hold indefinitely.

I kind of disagree. With your approach, all data gets (potentially)
trashed on opening, and it's hard to change the encoding afterwards.
You'd have to reopen the demuxer. And then you'd have to somehow cache
all input data (but only when you open subs), unless you're fine with
re-reading all data, etc....

> Concerning the specific issue of detecting the encoding, I believe a
> pluggable API is best: even if FFmpeg is built with only the basic internal
> heuristics, the application can provide support for
> libomniscientcharsetguess.
> 
> Regards,
>