[FFmpeg-devel] file protocol with Unicode support

Wed Apr 13 10:20:10 CEST 2011

Le quartidi 24 germinal, an CCXIX, Kirill Gavrilov a écrit :
> Any of UTF-XXX covers the whole Unicode.

Not exactly: UTF-8 can express any codepoint between 0 and 0x7FFFFFFF, while
UTF-16 can only express codepoints between 0 and 0x10FFFF.

The codepoints are densely defined up to 0x2FA1D at this time. This means
that the limit introduced by UTF-16 could possibly be a problem in a
foreseeable future.

> However UTF-16 is much simpler because each symbol has constant size.

This is completely false. You are confusing UTF-16 with UCS-2. UCS-2 has the
property you just stated, but is limited to codebpoints up to 0xFFFF, and
Unicode has exceeded that limit ten years ago. With UTF-16, each codepoint
takes either 2 or 4 octets.

The fact that a lot of badly informed people make that mistakes is another
reason why UTF-16 is such an idiotic choice.

And to be complete, even UCS-2 does not have the property you stated: there
is no such thing as "symbols". There are codepoints and characters.
Codepoints have a constant size in UCS-2, but characters can be made of any
amount of codepoints.

By the way, UTF-16 and UCS-2 (and UCS-4) also suffer from the endianness
problem.

> Since NTFS there no any mix. Files paths are stored as UTF-16 in filesystem
> and has (optionally
> short name for backward compatibility with old applications). Thats how I
> open Unicode names
> using FFmpeg in my application this time (GetShortPathName() do that trick).
> However DOS-names
> are optional (fortunately it still enabled by default even in Windows 7)
> and ugly (who wants to see crumpled 8-chars long names?).
> ANSI <-> Unicode conversions are done by WinAPI, thus only obsolete
> applications written against
> ANSI functions subset in WinAPI works ugly within different code pages.

Several APIs, optional compatibility stuff, automatic conversions with
stateful context-dependant parameters, and you dare say this is not an ugly
mix?

And please avoid using such a vague term as "ANSI": it means "American
National Standards Institute", this is not a technical term for an encoding.

> you can not. As on any other system functions with char* interface will work
> as this string in current code page.

Then change the "current code page", whatever that means.

> It will be UTF-8 in most Linux distributions

There is no such thing as code page in Linux, except in drivers for
microsoft's formats and protocols.

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20110413/0c4ff924/attachment.asc>