[FFmpeg-devel] file protocol with Unicode support

Wed Apr 13 13:07:24 CEST 2011

On Wed, 2011-04-13 at 12:52 +0200, Nicolas George wrote:
> > While using char* you need to know in which character encoding is your text.
> 
> No you don't. The only people who need to know the character encoding are
> those who do linguistic and rendering operations on it. The other ones only
> need to pass the string around.

Well, the information about the encoding does need to be preserved, and
passed around *with* the string.

Otherwise it just gets lost, and by the time the string needs to be
rendered, the encoding is unknown.

Historically, the biggest problem with the legacy 8-bit character sets
has been that people do "just pass the string around", and neglect to
also pass the charset information with it. And thus you end up having to
*guess* what character set a given string is, when you receive it.

Thankfully, the world is now becoming a slightly saner place as everyone
settles on UTF-8. We can convert legacy crap to UTF-8 on the way in to
our "system" (i.e. our library/program/network/database/whatever) and
always label things as UTF-8 on the way *out* of our system, and
everything works fairly well.

It's even fairly reasonable to assume that an unlabelled string is in
UTF-8 — and it's a testable hypothesis, too, since not all byte
sequences are valid UTF-8. So you can give an *error* if you are passed
a string that is not valid UTF-8, in a modern library.

Hm, I'm not sure where I was going with this... ah yes, I wanted to make
the point that "only need to pass the string around" is *wrong*. Do
*not* think like that. You *must* preserve the encoding information,
even if it's only done implicitly by always converting to UTF-8.

-- 
dwmw2