[FFmpeg-devel] file protocol with Unicode support

Kirill Gavrilov gavr.mail at gmail.com
Wed Apr 13 11:35:32 CEST 2011


>
> Not exactly: UTF-8 can express any codepoint between 0 and 0x7FFFFFFF,
> while
> UTF-16 can only express codepoints between 0 and 0x10FFFF.
>
OK, UTF-16 not so good. But this doesn't make it 'idiotic'.
Sure bytes order and backward compatibility with ASCII makes UTF-8 most used
today.

Several APIs, optional compatibility stuff, automatic conversions with
> stateful context-dependant parameters, and you dare say this is not an ugly
> mix?
>
Sure but at least using only Unicode API you'll give not abracadabra in the
application title.
While using char* you need to know in which character encoding is your text.

And please avoid using such a vague term as "ANSI": it means "American
> National Standards Institute", this is not a technical term for an
> encoding.
>
thats because it is very commonly used like that. From wiki:

> In Microsoft Windows <http://en.wikipedia.org/wiki/Microsoft_Windows>, the
> phrase "ANSI" refers to the
> Windows ANSI code pages <http://en.wikipedia.org/wiki/Windows_code_page>(even though they are not ANSI standards).
>

Then change the "current code page", whatever that means.
>
Code page or locale defined system-wide. You can not switch it to another
one in your application.
And at another point I don't think that UTF-8 can be used within some locale
in Windows
(only Windows ANSI code pages).
setlocale() function doesn't affect file I/O functions.

There is no such thing as code page in Linux, except in drivers for
> microsoft's formats and protocols.
>
Thats just a terms mismatch. Linux has locales. And in locales code pages
are defined.
This days for most languages utf8 is used:

> $ locale -a
> C
> en_GB.utf8
> fr_FR.utf8
> POSIX
> ru_RU.utf8
>

But 3-4 years ago this was not true and locale mismatch cause the filenames
and probably something another were displayed wrongly (as abracadabra).

2011/4/13 Nicolas George <nicolas.george at normalesup.org>

> Le quartidi 24 germinal, an CCXIX, Kirill Gavrilov a écrit :
> > Any of UTF-XXX covers the whole Unicode.
>
> Not exactly: UTF-8 can express any codepoint between 0 and 0x7FFFFFFF,
> while
> UTF-16 can only express codepoints between 0 and 0x10FFFF.
>
> The codepoints are densely defined up to 0x2FA1D at this time. This means
> that the limit introduced by UTF-16 could possibly be a problem in a
> foreseeable future.
>
> > However UTF-16 is much simpler because each symbol has constant size.
>
> This is completely false. You are confusing UTF-16 with UCS-2. UCS-2 has
> the
> property you just stated, but is limited to codebpoints up to 0xFFFF, and
> Unicode has exceeded that limit ten years ago. With UTF-16, each codepoint
> takes either 2 or 4 octets.
>
> The fact that a lot of badly informed people make that mistakes is another
> reason why UTF-16 is such an idiotic choice.
>
> And to be complete, even UCS-2 does not have the property you stated: there
> is no such thing as "symbols". There are codepoints and characters.
> Codepoints have a constant size in UCS-2, but characters can be made of any
> amount of codepoints.
>
> By the way, UTF-16 and UCS-2 (and UCS-4) also suffer from the endianness
> problem.
>
> > Since NTFS there no any mix. Files paths are stored as UTF-16 in
> filesystem
> > and has (optionally
> > short name for backward compatibility with old applications). Thats how I
> > open Unicode names
> > using FFmpeg in my application this time (GetShortPathName() do that
> trick).
> > However DOS-names
> > are optional (fortunately it still enabled by default even in Windows 7)
> > and ugly (who wants to see crumpled 8-chars long names?).
> > ANSI <-> Unicode conversions are done by WinAPI, thus only obsolete
> > applications written against
> > ANSI functions subset in WinAPI works ugly within different code pages.
>
> Several APIs, optional compatibility stuff, automatic conversions with
> stateful context-dependant parameters, and you dare say this is not an ugly
> mix?
>
> And please avoid using such a vague term as "ANSI": it means "American
> National Standards Institute", this is not a technical term for an
> encoding.
>
> > you can not. As on any other system functions with char* interface will
> work
> > as this string in current code page.
>
> Then change the "current code page", whatever that means.
>
> > It will be UTF-8 in most Linux distributions
>
> There is no such thing as code page in Linux, except in drivers for
> microsoft's formats and protocols.
>
> Regards,
>
> --
>   Nicolas George
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
>
> iEYEARECAAYFAk2lXLoACgkQsGPZlzblTJOkigCeKHWX2mau3f2C8K0Holj+4OIc
> 8fgAoIC7X+j7ltWF1aGdf0FYrLaOfhWt
> =G93Z
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
-----------------------------------------------
Kirill Gavrilov,
Software designer.


More information about the ffmpeg-devel mailing list