[FFmpeg-devel] [PATCH] libavcodec: Do not return encoding errors when -sub_charenc_mode is do_nothing

Mon Sep 9 16:46:18 CEST 2013

On 8 sep. 2013, at 19:38, Nicolas George <nicolas.george at normalesup.org> wrote:
> Le nonidi 19 fructidor, an CCXXI, Eelco Lempsink a écrit :
>> Ah, but there is where it gets interesting.  The fact that there is no
>> perfect solution means there is an opportunity for design.  That is where
>> user feedback and naming of options and choosing the right defaults
>> matter.  And that is the kind of discussion I wish we could have.  I’ll
>> start.
> 
> Ok, good.
> 
>> I think FFmpeg should try to do character encoding detection when no
>> character encoding is specified using a library that uses a heuristic
>> based on statistics.  As with analyzing video streams, FFmpeg can analyze
>> a bit of the subtitles and report the most probable encoding.  When trying
>> to run FFmpeg on a subtitle stream whose encoding could not be detected
>> (with enough confidence) and without specifying an encoding by hand, it
>> will exit with an error.
> 
> There are several issues to discuss.
> 
> First of all, analyzing audio and video to find the resolution or number of
> channels requires only a few frames, possibly just one. OTOH, analyzing text
> encoding based on statistical methods requires a lot of text.

That depends.  If you’re lucky, you’ll find a Unicode BOM or other byte combinations that indicate a certain encoding.  I don’t know what kind of confidence would be achievable, but...

> Therefore, it can not work for embedded subtitles, since it would hit the
> probesize limit before having anything near enough text.
> 
> Fortunately, AFAIK, all formats that allow embedded text subtitles specify
> the character encoding, making probing unnecessary. (Broken files probably
> exist; requiring user intervention to handle broken files seems acceptable.)

Great.

> That leaves the plain text subtitles files.

If you allow the probesize limit for plain text subtitle files that would be more than enough to achieve pretty good confidence scores if not the best possible. (Our application looks at about the first 5kB, which seems to work pretty well).

> The second point I want to make is that when lavc/lavf probe the audio and
> video streams, they do so using the codecs in lavc, not external libraries
> (except for a few codecs), and the result is reliable. FFmpeg is not
> intended to host a complex library of encoding detection, nor should such a
> library be a mandatory dependency.

That sounds acceptable to me.

> Therefore, probing using a statistical library should be optional, and for
> consistency of behaviour, it should be optional both at build time and at
> run time.

Again, sounds fair.

> That brings us back to my original proposal: use by default a simple and
> naïve (but good enough for a lot of cases) algorithm, and leave the option
> of compiling FFmpeg with support for third-party libraries doing a smarter
> detection.

At this point I don’t agree.  I think having multiple encoding detection systems is confusing for the user.  Furthermore, while the naïve algorithm works well if all goes well, when it does not work it’s impossible to tell.  In other works, it makes the easy case a slightly better, but the hard a lot worse.  I would prefer that FFmpeg would just take a stand and require the input to be UTF-8 if there is no encoding specified (and no detection available).

That said, if the ‘simple’ algorithm is limited to only doing conversions it can do perfectly (e.g. UTF-16 to UTF-8), it would be fine, in my opinion.

>> Fully agreed.  
>> 
>> Talking about simplification, would it be useful to simplify the current
>> situation first before introducing new stuff?  I reviewed the code and I
>> fail to see the need for the ‘sub_charenc_mode’ option:
> 
> The reason for that is that sub_charenc_mode is necessary for parts of the
> code that are not there yet.

I understand, and it might be a matter of taste, but I try to make complexity ‘earn’ its way into a code base.  Having stuff around for the future is often only confusing now.

>> I would therefore argue that the -sub_charenc_mode option should be
>> removed (also preventing confusion over what the options mean exactly).
> 
> As I explained once before, sub_charenc_mode is there to allow the recoding
> to happen before the decoder, for example in the demuxer. Without it, either
> the demuxer sets sub_charenc, and the recoding is done twice, or it does not
> set it and loses information.
> 
> Do you have a better proposal?

If sub_charenc is set to UTF8 recoding should not be done/necessary.  I agree that the loss of information is not nice, though.  I’m not familiar enough with the FFmpeg internals to say what the best way is to pass this kind of internal state around, but keeping it close to the current solution, how about:

- Remove sub_charenc_mode as an option from the command line tool (I think my argument still stands here, as the do_nothing option is not something that is meant to be passed from the command line, or can you give me an example where it does make sense?)

- Rename sub_charenc_mode to sub_charenc_recoded with values 1 (already recoded) and 0 (untouched).

Regards,

Eelco Lempsink
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 204 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130909/67664c03/attachment.asc>