[FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

Thu Sep 24 12:36:17 CEST 2015

Le duodi 2 vendémiaire, an CCXXIV, James Darnley a écrit :
> It is not supposed to replace any invalid bytes with a "random"
> character. That sounds like it will only make the problem worse with
> that lossy removal of data. This is trying to fix incorrect
> interpretation of bytes.
> 
> This feature is to transform bytes into other bytes which when
> interpreted and displayed the correct text is appears on screen.
> 
> I will detail my exact use case at the bottom.

Indeed, but a feature like that must be designed with all reasonable use
cases in mind, and replacing isolated invalid octets values in input by a
fixed replacement character is a common practice, totally acceptable when
dealing with large amounts of data. I am not saying that it must be enabled
by default, but it needs to be an option.

> What do you mean? You need at least two encodings to be given to perform
> a conversion.  Two things become a list.

Yes, but the code to handle more than two is much more complicated than it
would have been with just two elements, i.e. only one conversion.

> It might not be very good but it is (void*) and NULL if you don't use
> the feature.

Yes, I understood that, but this is fragile code.

> It shouldn't.  This function receives buf2_len as equal to BUF_LEN - 1
> which means that iconv can only advance buf2 to buf2 + BUF_LEN - 1 which
> will let us write 0 into the last byte.

In that case, it should be written very explicitly in the function
documentation, otherwise someone may change the code and break the
assumption.

Also, I notice another flaw in that code: it uses '\0' as string terminator.
Text in non-ASCII encodings can contain '\0'. The end of the text must be
handled differently, with an explicit size argument or maybe a pointer to
the end.

> I won't send another patch for a little while.  I will see how your API
> proposal plays out.
> 
> And now for my tale.
> 
> I wanted ffmpeg to turn the string at [1] into the string at [3].  [1],
> with the exact hex bytes at [2], is artist tag out of a Flac file.  Flac
> files have Vorbis Comment metadata tags.  They are UTF-8 only.  If a
> program puts incorrect data in there how will any other program know how
> to display it?  What's worse is when the data gets converted twice.

Indeed. All modern formats should specify UTF-8 for all strings, there is
absolutely no valid reason to do otherwise nowadays.

> This specific case was to convert CP1252 to UTF-8 to GBK -- that is to
> interpret the input bytes as the CP1252 encoding and convert them to
> UTF-8, then take those bytes and convert them to GBK.  I added the code
> needed to take an argument in the form
> > "CP1252,UTF-8,GBK"
> parse it into separate encodings, open two iconv contexts, and finally
> perform the conversion.

I can not reproduce your conversion, but there is something that bugs me in
your reasoning.

With any sane iconv implementation, converting from A to B then from B to C
will either fail if B is too poor to code the input, or succeed and then
give the exact same result as converting directly from A to C.

As for chaining a conversion from A to B then from C to D with B != C, this
is braindead, and if FFmpeg were to support it at all, it should only be
with a very explicit request from the user to perform a dangerously lossy
conversion.

I do not understand what you were trying to achieve with your
"CP1252,UTF-8,GBK" conversion. First, why finish with GBK instead of sane
UTF-8? And second, if the output is really wanted in GBK, then what is the
purpose of the intermediate UTF-8 step?

IMHO, the only sane way of dealing with text is to handle everything in
UTF-8 internally: demuxers and anything that deals with input should convert
as soon as the encoding is known, muxers and anything that deals with output
should convert as late as possible.

(We do not do that for audio and video because conversions are expensive,
but text conversions are cheap and negligible compared to video and audio
processing.)

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20150924/8ec4748b/attachment.sig>