[FFmpeg-devel] file protocol with Unicode support

Kirill Gavrilov gavr.mail at gmail.com
Wed Apr 13 09:50:36 CEST 2011


>
> As UTF-8 cover the whole of Unicode,
>
Any of UTF-XXX covers the whole Unicode. The different is how one symbol is
stored.
Multibyte scheme used in UTF-8 allows to silently work with strings as char*
and the most
compatible with old APIs. However UTF-16 is much simpler because each symbol
has constant size.

I do not know much of windows, but as far as I know, file names in windows
> are an ugly and incomprehensible mix: they are supposed to be made of
> Unicode codepoints, they sometimes look like ASCII or extended ASCII and
> sometimes look like multibyte strings in UTF-16 (which is one of the
> stupidest things ever invented).
>
Since NTFS there no any mix. Files paths are stored as UTF-16 in filesystem
and has (optionally
short name for backward compatibility with old applications). Thats how I
open Unicode names
using FFmpeg in my application this time (GetShortPathName() do that trick).
However DOS-names
are optional (fortunately it still enabled by default even in Windows 7)
and ugly (who wants to see crumpled 8-chars long names?).
ANSI <-> Unicode conversions are done by WinAPI, thus only obsolete
applications written against
ANSI functions subset in WinAPI works ugly within different code pages.

better than UTF-16, if there is some
> way to force windows to parse the string as UTF-8, it would solve the
> problem (except if there are files with broken UTF-16 surrogates; I do not
> know if fsck tools consider this an error). It would be much simpler.
>
you can not. As on any other system functions with char* interface will work
as this string in current code page.
It will be UTF-8 in most Linux distributions but on Windows it will be some
ANSI codepage.


2011/4/13 Nicolas George <nicolas.george at normalesup.org>

> Le quartidi 24 germinal, an CCXIX, Tomas Härdin a écrit :
> > The file protocol works fine for UTF-8 paths on Linux based systems
> > last time I checked.
>
> Linux/Unix file names are byte-based: a filename is any sequence of bytes
> except 0 and 0x2F ('/'). Interpreting this sequence of bytes as a sequence
> of characters is left to the discretion of each tool that displays file
> names or gets them from the user. This is usually done according to the
> locale settings, using an ASCII-compatible encoding, and these days UTF-8
> is
> the most common choice.
>
> As ffmpeg does not itself directly display or read file names (it acts
> through a tty), all these subtleties are irrelevant for it.
>
> I do not know much of windows, but as far as I know, file names in windows
> are an ugly and incomprehensible mix: they are supposed to be made of
> Unicode codepoints, they sometimes look like ASCII or extended ASCII and
> sometimes look like multibyte strings in UTF-16 (which is one of the
> stupidest things ever invented).
>
> So as far as I understand, Kirill's request is legitimate: because of
> windows's idiotic way of implementing file names i18n, there is specific
> work to do in each application to handle non-ASCII file names.
>
> But I think his patch is way too complex for that.
>
> As UTF-8 cover the whole of Unicode, better than UTF-16, if there is some
> way to force windows to parse the string as UTF-8, it would solve the
> problem (except if there are files with broken UTF-16 surrogates; I do not
> know if fsck tools consider this an error). It would be much simpler.
>
> Regards,
>
> --
>   Nicolas George
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
>
> iEYEARECAAYFAk2lUI0ACgkQsGPZlzblTJNcpACghu0Rayrh1kGlocF267BGh4uA
> FgYAnR6CCdD5lSfx3kbdagTrfJ/F5p8J
> =7x90
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> -----------------------------------------------
Kirill Gavrilov,
Software designer.


More information about the ffmpeg-devel mailing list