[FFmpeg-devel] [PATCH 3/3] lavc: check decoded subtitles encoding.

Reimar Döffinger Reimar.Doeffinger at gmx.de
Sun Apr 7 11:52:27 CEST 2013

On Sun, Apr 07, 2013 at 11:19:42AM +0200, Nicolas George wrote:
> I do not think it is a real problem if the validation is not 100%
> waterproof: there is no formal definition of valid UTF-8 (like there is for
> XML), only guidelines to detect common bugs and limitations that depend on
> the use.

I think this very much is quite the opposite, since that means you have
to come up with a proper definition of "valid" first, making the problem
more complex, not less.

> On the question of validating carefully, it is actually fairly trivial.
> Testing the codepoints is actually simpler than extracting them in the first
> place.

Which means you are only validating individual code points.
However there are questions like which if any non-normalized
representations should be allowed (e..g is only ä valid or is a
combining pair of a and two dots ok?), what sequences should be valid
(e.g. is it ok to have two consecutive RTL markers? If you have a RTL
marker, must there also be a LTR marker? If not concatenating the text
will change how it looks).
Should right-to-left text be allowed for ASCII, combined with
left-to-right on a single line? This might make text look completely
different from e.g. what it would actually do if you copy-pasted it as
pure text, which can be very relevant if the code is used for anything
other than just subtitles (or subtitles containing instructions).
Since I am no expert there's certain to be at least tens of more things
to check which I am not even aware of.

More information about the ffmpeg-devel mailing list