[FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

Tue Dec 16 23:03:48 CET 2014

> On Dec 14, 2014, at 10:06, Nicolas George <george at nsup.org> wrote:
> 
> Le tridi 23 frimaire, an CCXXIII, Rodger Combs a écrit :
>> I couldn't see a sensible way to do this in lavc, since the detector
>> libraries generally require more than one packet to work effectively.
>> Looking at that doxy again, I can see how the detection could be done in
>> lavf and the conversion in lavc, but I don't really see an advantage there
>> other than fewer API changes.
> 
> There is no benefit in doing the conversion in lavc for text files, but text
> files processed by lavf are not the only source of subtitles. The conversion
> in lavc must stay there for those cases, and the conversion in lavf must
> work gracefully with it.
> 

Hmm, and unless I'm missing something, there's no way to do the detection in lavf and the conversion in lavc, so we'll have to have it in both regardless. Still, it'd probably make sense to have the actual conversion code be handled generically in lavu.

>> So, by default it'd just handle encoding, and then additional
>> normalization features could be enabled by the consumer? Sounds useful
>> indeed.
> 
> Something like that. You can have a look at the first draft for the API
> there:
> 
> http://ffmpeg.org/pipermail/ffmpeg-devel/2013-August/146979.html
> 
> Splitting lines and normalizing LF was enabled by a flag.
> 
> The API itself will probably need to be changed to allow pluggable detection
> modules without using more global state.
> 

Looks like a good point to work from.

>> I like this model in general, but it brings up a few questions that I kind
>> of dodged in my patch. For instance, how should lavu determine which
>> module's output to prefer if there are conflicting charenc guesses? How
>> can the consumer choose between the given guesses?
> 
>> In my patch, preference is very simplistic and the order is hardcoded. In
>> a more modular system, it'd have to be a bit more complex; I can imagine
>> some form of scoring system, or even another type of module that ranks
>> possible guesses, but that could get very complex very fast. Any ideas for
>> this?
> 
> In this case, I believe that keeping simple at API level is the best
> approach: the detection state is held in a structure, each detection module
> is called in turn with the same structure and update it with its result.
> 
> Then, it is only a matter of specifying what an acceptable "update" is: only
> change a value if the new value is more sure than the previous one.
> 
> As for the exact fields that must be present in the structure, that depends
> on the exact useful information each relevant libraries can return.
> 

The trouble here is that some detection libraries don't provide a "certainty" parameter, or don't expose it.

>> In my patch, the consumer can override the choice of encoding by making
>> changes to the AVFormatContext between "header reading" and retrieving the
>> packet; it seems like the best way to do so in your system would be to
>> pass a callback.
> 
> Can you explain in what situation this kind of overriding would be
> necessary?
> 

For instance, if a player (or even ffmpeg.c) tries to play/transcode a subtitle file and finds itself with multiple guesses for its encoding, it may want to present the user with a UI to have them select (what they think is) the correct one from the list, or enter the actual value if all guesses were wrong.

>> On a bit of a side-note: my system is designed to make every possible
>> effort to return a recoded packet, with multiple layers of fallback
>> behavior in case the first guess turns out to be incorrect or the source
>> file is outright invalid. I wouldn't expect that to be significantly more
>> difficult with your design, but I wonder what your opinions on the setup
>> are?
> 
> For this, I believe this is on a per-user basis. Some users want that
> everything works automagically, some users want to be notified even if the
> smallest detail goes unexpected. In the end, it should probably come to an
> option:
> 
> ffmpeg -text_encoding certainty_threshold=80:allow_substitute=invalid
> 
> for example, to accept a guess only when it has at least 80% certainty and
> allow to replace invalid input sequences by a mask character.
> 

Sounds generally sensible, except that the certainty parameter isn't returned by all detection libraries.

>> So, the text-file-read API would buffer the entire input file and perform
>> charenc detection/conversion and/or other normalization, then FFTextReader
>> would read from the normalized buffer?
> 
> Something like that. Since FFTextReader is internal, there is room to choose
> the exact implementation.
> 
> Regards,
> 
> -- 
>  Nicolas George
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel