[FFmpeg-devel] UTF-8/encoding/string handling ideas

Sat Nov 3 17:19:14 CET 2007

Hi

On Mon, Oct 22, 2007 at 10:24:46PM -0400, Rich Felker wrote:
> Based on the UTF-8 decoder bug report thread I started, I have some
> ideas for addressing problems in ffmpeg. Basically, the problems seem
> to come down to 2 areas:
> 
> 1. The ffmpeg application is encoding-agnostic and simply passes
> strings from the command line (e.g. container metadata) into the
> libraries without tagging the character encoding or converting it to
> UTF-8.
> 
> 2. FFmpeg libraries assume text passed to them is UTF-8, but do not
> validate UTF-8 provided by the caller before storing it to files, and
> may do really bogus things with invalid data (which the caller likely
> does not check) if converting to UCS-2/4, UTF-16, etc. for storing to
> a file.
> 
> I think point #1 is proof that point #2 is a problem. If even the
> reference application using the libs gets it wrong and passed
> incorrectly encoded or unvalidated data, how can other apps using the
> libs be expected to do better?
> 
> Here are my ideas towards a solution:
> 
> -- Behavior of the libraries (mainly libavformat)
> 
> It's good that the libraries expect data to be passed as UTF-8. FFmpeg
> does not deal with the locale's text encoding, and rightfully so
> because it's dealing with data in files that could include all sorts
> of characters not representible in the locale. Let's just make the
> parts that use text strings validate the UTF-8 before storing it to
> files and generate hard errors if there's anything invalid. Silently
> doing substitutions is a bad idea because then you end up with
> incorrect files after 12+ hour encoding jobs rather than detecing the
> mistake early.

yes

> 
> Should the libraries also generate hard errors if the field to be
> written only supports ASCII but non-ASCII characters are passed?
> Probably... Thoughts?

and yes

but where are the patches?

> 
> -- Behavior of ffmpeg application
> 
> Here, I think we have several options:
> 
> The UTF-8-enforcer part of me wants to say FFmpeg does not want to
> have to deal with text encodings, and should just forbid non-ASCII
> text input whenever the locale's encoding is not UTF-8. This solution
> is simple and robust, and it ensures that the strings passed to the
> libraries are always UTF-8 (because ASCII is UTF-8 too) without doing
> any conversions. It sounds like something Michael might like too. :)
> 
> The let's-be-fair-to-everyone part of me, on the other hand, says some
> conversion might be in order. My idea for conversion was to check for
> the __STDC_ISO_10646__ macro, and if it's present, use mbrtowc
> function to convert the local encoding to wchar_t, then UTF-8 encode
> the UCS-4 values in the wchar_t. This avoids having to depend on iconv
> or nl_langinfo(CODESET) which are notoriously unreliable on some
> platforms. If __STDC_ISO_10646__ is not defined, the behavior would be
> to fall back to the big-meanie behavior described before, no non-ASCII
> text allowed.
> 
> Either way, we end up with UTF-8 strings to pass to the libraries.
> 
> Also, note that the 2 approaches aren't mutually exclusive. We could
> quickly add a "reject 8-bit octets if the encoding is not UTF-8"
> option now to prevent invalid data, and later extend to the second
> option.

patches welcome, and yes i take what i get, the one writing the patch
can decide which method he prefers, i dont mind as long as its simple
portable, clean and works

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

It is dangerous to be right in matters on which the established authorities
are wrong. -- Voltaire
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071103/61ecbe52/attachment.pgp>