[FFmpeg-devel] [RFC][PATCH] ffmpeg: add option to transform metadata using iconv

Thu Sep 24 15:15:06 CEST 2015

On 2015-09-24 12:36, Nicolas George wrote:
> Le duodi 2 vendémiaire, an CCXXIV, James Darnley a écrit :
>> It is not supposed to replace any invalid bytes with a "random"
>> character. That sounds like it will only make the problem worse with
>> that lossy removal of data. This is trying to fix incorrect
>> interpretation of bytes.
>>
>> This feature is to transform bytes into other bytes which when
>> interpreted and displayed the correct text is appears on screen.
>>
>> I will detail my exact use case at the bottom.
> 
> Indeed, but a feature like that must be designed with all reasonable use
> cases in mind, and replacing isolated invalid octets values in input by a
> fixed replacement character is a common practice, totally acceptable when
> dealing with large amounts of data. I am not saying that it must be enabled
> by default, but it needs to be an option.

Noted.

As far as I understand the iconv API, it doesn't appear to do this for
you.  So adding this feature would require writing code to handle more
errors returned from the iconv() function.  That means a more
complicated argument handling structure is needed.

I don't mind trying to write this but it would be better to do it behind
the API you propose.  I will help you with it as best I can because I
seem to have involuntarily volunteered myself.

>> What do you mean? You need at least two encodings to be given to perform
>> a conversion.  Two things become a list.
> 
> Yes, but the code to handle more than two is much more complicated than it
> would have been with just two elements, i.e. only one conversion.
> 
>> It might not be very good but it is (void*) and NULL if you don't use
>> the feature.
> 
> Yes, I understood that, but this is fragile code.

Again, noted.  I can't see this being much less fragile though.  Having
thought again, I could reuse the av_dynarray that holds the char
encoding strings.

>> It shouldn't.  This function receives buf2_len as equal to BUF_LEN - 1
>> which means that iconv can only advance buf2 to buf2 + BUF_LEN - 1 which
>> will let us write 0 into the last byte.
> 
> In that case, it should be written very explicitly in the function
> documentation, otherwise someone may change the code and break the
> assumption.

Will add a comment to note this.

> Also, I notice another flaw in that code: it uses '\0' as string terminator.
> Text in non-ASCII encodings can contain '\0'. The end of the text must be
> handled differently, with an explicit size argument or maybe a pointer to
> the end.

Yes.  A user may request a conversion through UCS-2 which would break
because I call strlen() on the resulting string.  I need to return the
output length as it comes from iconv.

>> I won't send another patch for a little while.  I will see how your API
>> proposal plays out.
>>
>> And now for my tale.
>>
>> I wanted ffmpeg to turn the string at [1] into the string at [3].  [1],
>> with the exact hex bytes at [2], is artist tag out of a Flac file.  Flac
>> files have Vorbis Comment metadata tags.  They are UTF-8 only.  If a
>> program puts incorrect data in there how will any other program know how
>> to display it?  What's worse is when the data gets converted twice.
> 
> Indeed. All modern formats should specify UTF-8 for all strings, there is
> absolutely no valid reason to do otherwise nowadays.

Yes I agree.  I would suspect the problem comes from older software.

>> This specific case was to convert CP1252 to UTF-8 to GBK -- that is to
>> interpret the input bytes as the CP1252 encoding and convert them to
>> UTF-8, then take those bytes and convert them to GBK.  I added the code
>> needed to take an argument in the form
>>> "CP1252,UTF-8,GBK"
>> parse it into separate encodings, open two iconv contexts, and finally
>> perform the conversion.
> 
> I can not reproduce your conversion, but there is something that bugs me in
> your reasoning.
> 
> With any sane iconv implementation, converting from A to B then from B to C
> will either fail if B is too poor to code the input, or succeed and then
> give the exact same result as converting directly from A to C.
> 
> As for chaining a conversion from A to B then from C to D with B != C, this
> is braindead, and if FFmpeg were to support it at all, it should only be
> with a very explicit request from the user to perform a dangerously lossy
> conversion.
> 
> I do not understand what you were trying to achieve with your
> "CP1252,UTF-8,GBK" conversion. First, why finish with GBK instead of sane
> UTF-8? And second, if the output is really wanted in GBK, then what is the
> purpose of the intermediate UTF-8 step?
> 
> IMHO, the only sane way of dealing with text is to handle everything in
> UTF-8 internally: demuxers and anything that deals with input should convert
> as soon as the encoding is known, muxers and anything that deals with output
> should convert as late as possible.
> 
> (We do not do that for audio and video because conversions are expensive,
> but text conversions are cheap and negligible compared to video and audio
> processing.)
> 
> Regards,

I don't know what to say here.  I know the encodings needed for iconv
because I arrived at them by brute force.  I wrote a short Lua script to
iterate over a list of encodings supported by my iconv and arrived at
this answer.  The command line tool called iconv is too clever for this
because it returns an error when it can't convert.  As for ending in
GBK, it is what the script told me.

(Well it told me the last encoding could be any of: CP936 GB18030 GBK.)

(Even then it isn't perfect because another file from the same CD which
I would have thought would be created with the same incorrect process
requires ISO-8859-1 as the first encoding.)

I assume the forward process that made the broken tag was this:
- A piece of software interpreted the correct UTF-8 string as GBK
- It then converted it to UTF-8
- Another piece of software then converted that UTF-8 to CP1252
- Those resulting bytes were placed into the Flac file

This is why the reverse process I created results in (almost) the
original UTF-8 string.

This feature would not work if there was a misinterpretation in the
middle.  As you say that would need A->B and C->D where B != C.  Perhaps
this is why my solution isn't perfect, because there should be an
assumption in the middle.

I could rework my code to allow for assumptions in the middle.  My case
would then use "CP1252,UTF-8,UTF-8,GBK" as an argument.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: OpenPGP digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20150924/53dd1b31/attachment.sig>