[FFmpeg-devel] [PATCH 3/3] lavc: check decoded subtitles encoding.

Sun Apr 7 12:51:34 CEST 2013

L'octidi 18 germinal, an CCXXI, Reimar Döffinger a écrit :
> Why? Do you have an real-world example where it would fail (note: due to
> wrong codepage, the message you print seems to indicate this check is
> about missing code page specification).

It accepts "\xFE\x80\x80\x80\x80\x80\x80" (or anything in the \x80-\xBF
range instead of \x80) as a non-standard 7 bytes sequence. It is a bit
far-fetched, but it could happen with a smiley and non-break spaces for
spacing, or possibly with Cyrillic or Green with mixed-case words.

Actually, it would probably be fairly easy to fix this:

-        if ((val & 0xc0) == 0x80)\
+        if ((val & 0xc0) == 0x80 || val >= 0xFE)\

> I think this very much is quite the opposite, since that means you have
> to come up with a proper definition of "valid" first, making the problem
> more complex, not less.

There is the Gordian knot-style solution: the function is its own
definition. As long as the function accepts strings that everyone will
consider valid, and rejects strings that everyone considers invalid, having
an unspecified behaviour for the dubious cases in-between seems acceptable.

> Which means you are only validating individual code points.
> However there are questions like which if any non-normalized
> representations should be allowed (e..g is only ä valid or is a
> combining pair of a and two dots ok?), what sequences should be valid
> (e.g. is it ok to have two consecutive RTL markers? If you have a RTL
> marker, must there also be a LTR marker? If not concatenating the text
> will change how it looks).
> Should right-to-left text be allowed for ASCII, combined with
> left-to-right on a single line? This might make text look completely
> different from e.g. what it would actually do if you copy-pasted it as
> pure text, which can be very relevant if the code is used for anything
> other than just subtitles (or subtitles containing instructions).
> Since I am no expert there's certain to be at least tens of more things
> to check which I am not even aware of.

I believe there is a pretty clear distinction between syntactic and semantic
validity. What you describe about combining vs. combined, LTR markers, etc.,
requires a fair share of the per-codepoint Unicode database, and is
therefore clearly semantic.

Overlong encodings (i.e. using more bytes than necessary by adding leading
0s), OTOH, is clearly syntactic, and are almost never tested by normal
string functions. For example "\xC0\xA0" would be valid if it was not
overlong, and it is fairly plausible in ISO-8859-1 French ("LÀ !").

The case of surrogates, rBOM and high planes is more dubious, the case for
them being semantic is not absurd. But neither is it to consider them
syntactic.

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130407/fb7d447c/attachment.asc>