[FFmpeg-devel] [PATCH 3/3] lavc: check decoded subtitles encoding.

Reimar Döffinger Reimar.Doeffinger at gmx.de
Sun Apr 7 13:50:08 CEST 2013

On Sun, Apr 07, 2013 at 12:51:34PM +0200, Nicolas George wrote:
> L'octidi 18 germinal, an CCXXI, Reimar Döffinger a écrit :
> > Why? Do you have an real-world example where it would fail (note: due to
> > wrong codepage, the message you print seems to indicate this check is
> > about missing code page specification).
> It accepts "\xFE\x80\x80\x80\x80\x80\x80" (or anything in the \x80-\xBF
> range instead of \x80) as a non-standard 7 bytes sequence. It is a bit
> far-fetched, but it could happen with a smiley and non-break spaces for
> spacing, or possibly with Cyrillic or Green with mixed-case words.
> Actually, it would probably be fairly easy to fix this:
> -        if ((val & 0xc0) == 0x80)\
> +        if ((val & 0xc0) == 0x80 || val >= 0xFE)\

Not sure I see this as a big issue/likely to happen in real-world cases.
Still there is a bug in GET_UTF8: It claims to read up to 7 times max,
however since the latest optimizations it will read 9 times for this
FF 80 80 80 80 80 80 80 80
So yes, a check like this should probably be added - or the function
"deoptimized" to a non-broken state.

> > I think this very much is quite the opposite, since that means you have
> > to come up with a proper definition of "valid" first, making the problem
> > more complex, not less.
> There is the Gordian knot-style solution: the function is its own
> definition. As long as the function accepts strings that everyone will
> consider valid, and rejects strings that everyone considers invalid, having
> an unspecified behaviour for the dubious cases in-between seems acceptable.

IMHO not for a public API, and preferably not for something that has
"valid" in its name.
If it's up to discussion I'd have preferred a more vague name like
guess_utf8 or such.

> Overlong encodings (i.e. using more bytes than necessary by adding leading
> 0s), OTOH, is clearly syntactic, and are almost never tested by normal
> string functions. For example "\xC0\xA0" would be valid if it was not
> overlong, and it is fairly plausible in ISO-8859-1 French ("LÀ !").

Possibly reasonable to add to GET_UTF8 as well - though might be worth
discussion, also for performance reasons.
Otherwise I believe this would do it:
--- a/libavutil/common.h
+++ b/libavutil/common.h
@@ -279,6 +279,7 @@ static av_always_inline av_const int av_popcount64_c(uint64_t x)
 #define GET_UTF8(val, GET_BYTE, ERROR)\
     val= GET_BYTE;\
+        uint32_t min = 0;\
         uint32_t top = (val & 128) >> 1;\
         if ((val & 0xc0) == 0x80)\
@@ -287,9 +288,12 @@ static av_always_inline av_const int av_popcount64_c(uint64_t x)
             val= (val<<6) + tmp;\
+            min = top << 1;\
             top <<= 5;\
         val &= (top << 1) - 1;\
+        if (val < min)\
+            ERROR\

More information about the ffmpeg-devel mailing list