[FFmpeg-devel] [PATCH] lavu/avstring: add av_get_utf8() function

Wed Nov 13 19:15:31 CET 2013

On date Wednesday 2013-11-13 17:51:32 +0100, Nicolas George encoded:
> Le tridi 23 brumaire, an CCXXII, Stefano Sabatini a écrit :
> > Yes, on the other hand developers would easily forget to update their
> > commit log, resulting in missing entries in the resulting output (and
> > we're not allowed to change log commits). I don't know if git allows
> > to markup specific commits after they have been committed.
> 
> There is git-notes, it allows to attach a note that will be displayed along
> with the commit message. Unfortunately, it is not cloned by default. Another
> solution would be to add an empty commit to the history with the APIchanges
> tag; unfortunately, in that case the commit would not appear in the log for
> the corresponding files.
> 
> But enough of this digressions.
> 
> > Changed both, but with the only difference that endp points to the
> > last byte in the buffer, in order to avoid overflow issues.
> 
> The C standard specifically allows pointers to the first byte after an
> object, probably exactly for this kind of situation. And it is easier to
> write:
> 
>     end = buf + size;
> 
> ... than to subtract one, because you must check size for 0 (C does not
> allow a pointer to the byte before an object, and anyways size is probably
> unsigned).

Suppose that you have an overflow with PTR+1, then you have PTR+1=0 <
PTR, in this case the code will misbehave. I don't know if the specs
explicitly allow this (PTR+1 for every allocated byte pointer should
not overflow).

> > I implemented the code < (1<<31) check in the patch. I don't know what
> > you exactly mean by "Unicode range check", indeed there is a lot of
> > documentation about which code points should be considered valid, and
> > for some it is not entirely clear (for example surrogates).
> 
> There is absolutely no doubt about surrogates: they are only valid in
> UTF-16.
> The most ambiguous issue is the upper bound: it was initially 0xFFFF, then
> became 0x7FFFFFFF when thousands of ideograms were found in old books, and
> then was lowered to 0x10FFFF when it became apparent that microsoft and sun
> had once again made a mess with UTF-16.
> 
> > Which flags do you propose to support?
> 
> Default, accept any code that is structurally valid in current Unicode:
> 0x000000-0x10FFFF except the surrogates planes and 0xFFFE and 0xFFFF.

> Flag #1: accept any code that is structurally possible in UTF-8, i.e.
> 0x00000000-0x7FFFFFFF.
> Flag #2: reject codes that would make XML choke.

That is: exclude various ASCII control codes, UTF-16 surrogates, and
codes over 0x10FFFF upper bound.

> (Flag #3: toggle the default check for overlong encodings.)

?

Or we could have something like:
AV_UTF8_CHECK_RANGE_FLAG_EXCLUDE_OVERLONG       ///< exclude codepoints over 0x10FFFF)
AV_UTF8_CHECK_RANGE_FLAG_EXCLUDE_CONTROL        ///< exclude invalid XML control codes
AV_UTF8_CHECK_RANGE_FLAG_EXCLUDE_SURROGATES     ///< exclude UTF-16 surrogates codes
AV_UTF8_CHECK_RANGE_FLAG_EXCLUDE_NON_CHARACTERS ///< exclude non-characters - 0xFFFE and 0xFFFF

and so we could define:
#define AV_UTF8_CHECK_RANGE_FLAG_XML \
        EXCLUDE_SURROGATES|EXCLUDE_OVERLONG|EXCLUDE_NON_CHARACTERS|EXCLUDE_CONTROL

A safe default could be:
#define AV_UTF8_CHECK_RANGE_FLAG_LOOSE \
        EXCLUDE_SURROGATES|EXCLUDE_OVERLONG|EXCLUDE_NON_CHARACTERS

> > I cheated, indeed this list is directly taken from the XML specs:
> > http://www.w3.org/TR/xml/#charsets
> > 
> > after much time spent browsing various Unicode documents. Thus I
> > suppose these ranges should be universally accepted by XML parsers.
> 
> Ok.
> 
> > On the other hand I'm not sure what we should really disallow by
> > default, for example JSON parsers are usually much less strict than
> > XML parsers with regards to accepted code-points.
> 
> I agree, but surrogates, 0xFFFE, 0xFFFF and codes beyond 0x10FFFF should
> really not be there.
-- 
FFmpeg = Frenzy and Formidable MultiPurpose Epic Gargoyle