[FFmpeg-devel] [PATCH] lavu/avstring: add av_get_utf8() function

Nicolas George george at nsup.org
Wed Nov 13 17:51:32 CET 2013


Le tridi 23 brumaire, an CCXXII, Stefano Sabatini a écrit :
> Yes, on the other hand developers would easily forget to update their
> commit log, resulting in missing entries in the resulting output (and
> we're not allowed to change log commits). I don't know if git allows
> to markup specific commits after they have been committed.

There is git-notes, it allows to attach a note that will be displayed along
with the commit message. Unfortunately, it is not cloned by default. Another
solution would be to add an empty commit to the history with the APIchanges
tag; unfortunately, in that case the commit would not appear in the log for
the corresponding files.

But enough of this digressions.

> Changed both, but with the only difference that endp points to the
> last byte in the buffer, in order to avoid overflow issues.

The C standard specifically allows pointers to the first byte after an
object, probably exactly for this kind of situation. And it is easier to
write:

    end = buf + size;

... than to subtract one, because you must check size for 0 (C does not
allow a pointer to the byte before an object, and anyways size is probably
unsigned).

> I implemented the code < (1<<31) check in the patch. I don't know what
> you exactly mean by "Unicode range check", indeed there is a lot of
> documentation about which code points should be considered valid, and
> for some it is not entirely clear (for example surrogates).

There is absolutely no doubt about surrogates: they are only valid in
UTF-16.

The most ambiguous issue is the upper bound: it was initially 0xFFFF, then
became 0x7FFFFFFF when thousands of ideograms were found in old books, and
then was lowered to 0x10FFFF when it became apparent that microsoft and sun
had once again made a mess with UTF-16.

> Which flags do you propose to support?

Default, accept any code that is structurally valid in current Unicode:
0x000000-0x10FFFF except the surrogates planes and 0xFFFE and 0xFFFF.

Flag #1: accept any code that is structurally possible in UTF-8, i.e.
0x00000000-0x7FFFFFFF.

Flag #2: reject codes that would make XML choke.

(Flag #3: toggle the default check for overlong encodings.)

> I cheated, indeed this list is directly taken from the XML specs:
> http://www.w3.org/TR/xml/#charsets
> 
> after much time spent browsing various Unicode documents. Thus I
> suppose these ranges should be universally accepted by XML parsers.

Ok.

> On the other hand I'm not sure what we should really disallow by
> default, for example JSON parsers are usually much less strict than
> XML parsers with regards to accepted code-points.

I agree, but surrogates, 0xFFFE, 0xFFFF and codes beyond 0x10FFFF should
really not be there.

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20131113/ba3362a2/attachment.asc>


More information about the ffmpeg-devel mailing list