[FFmpeg-devel] [PATCH] libavcodec: Do not return encoding errors when -sub_charenc_mode is do_nothing

Thu Aug 29 14:03:28 CEST 2013

Le duodi 12 fructidor, an CCXXI, Eelco Lempsink a écrit :
> First off, I’m really happy to see FFmpeg taking character encodings into
> consideration.  I honestly thought that ‘do_nothing’ meant something else,
> I didn’t mean to send a snark-patch.  (I only saw the previous discussion
> after I send the patch.)

No problem.

> That said, the current implementation is incomplete and does not belong in
> a release version, in my opinion.  When using FFmpeg in part of a larger
> workflow where character encodings are not known in advance, you simply
> need all the bits to properly guess the encoding.  Even when the recoding
> API is fully implemented there will be cases where FFmpeg will make the
> wrong guess (the alternative would be to wait until all the data is known
> and then make the guess, which would be a problem for streams).
> 
> I have some experience doing character encoding detection using libicu
> (http://site.icu-project.org) since that is what we currently use to
> process the output of FFmpeg, and I’d be happy to offer advice and help,
> if wanted.
> 
> How about this:
> - Rename the ‘do_nothing’ option to ‘assume_utf8’.
> - Add a ‘passthrough’ option.
> Would that be an acceptable patch?

This is not how things are supposed to work in this case. Let me explain the
plan once, because the whole thing is split in several mail discussions.

First of all, the output of avcodec_decode_subtitles2() is supposed to be
text, not binary data. Therefore, nowadays, it needs to be encoded in some
sort of Unicode encoding, and the only two reasonable solutions are uint8_t
as an UTF-8 string or uint32_t as a wide char string. The UTF-8 solution is
more reasonable nowadays.

Therefore, if the output of avcodec_decode_subtitles2() is not valid UTF-8,
that means there is something wrong going on.

Thus, removing/bypassing the test like you propose is similar to the classic
joke "Doctor, it hurts when I press here. -- Then don't do it.": it is
silencing the warning, not fixing the cause of the problem. The correct way
of handling it is to find what is causing binary data to leak where text is
expected.

The same goes for audio, for that matter: the samples produced by
avcodec_decode_audio4() are numbers, not binary data. Therefore, if an audio
decoder outputs samples in little endian on a big endian architecture, it is
bogus and needs to be fixed. If there was a way of detecting samples in
reversed endianness without rejecting any valid samples, it would be a good
idea to add it (but of course it does not exist).

The job of converting the subtitle into proper UTF-8 text can be done in one
of several places, depending on the situation.

* In the decoder itself. That is the most generic possibility, but it
  requires changing all decoders. It is necessary for decoders that mix
  binary data with the actual text. Otherwise, more generic solutions are
  more convenient.

* In the generic code just before the decoder. That applies for subtitles
  formats where the packets are text with markup (ASS, SRT, etc.). In this
  case, the generic code converts the encoding of the packet payload before
  submitting it to the decoder.

* In the demuxer: that is necessary for subtitles read from text files in
  UTF-16, because, for example, the SRT demuxer expects "-->", not
  "-\0-\0>\0".

The AVCodecContext.sub_charenc_mode field is used to distinguish between
these situations: PRE_DECODER activates the second one, while DO_NOTHING
means that the job is done either by the demuxer or the decoder.

But the DO_NOTHING mode does not mean that the output is not expected to be
UTF-8: the output is always expected to be UTF-8, and changing that would be
a very bad idea. The DO_NOTHING mode means that the conversion to UTF-8 is
done by some other part of the code. Unfortunately, these other parts do not
exist yet. Note that I submitted a month ago a patch implementing the
encoding conversion in the SRT demuxer, but it was not reviewed yet.

The final check, the one you propose to disable, is there to ensure that the
job was done at least once: it is there to catch situations where a bogus
code path can lead to binary data leaking to the output.

Therefore, using sub_charenc_mode to bypass the final check is a mistake. I
will not oppose adding another option to do that, as long as it is properly
documented "this option disables the check that decoded subtitles are in
proper UTF-8; use this option as your own risks, as it hides bugs and bad
uses of the API".

But I suggest you describe your use case before that, because I suspect
that, even right now, you can do what you need without this option.

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130829/c6bafcd1/attachment.asc>