[FFmpeg-devel] [RFC] function to check for valid UTF-8 string

Rich Felker dalias
Mon Dec 10 17:01:59 CET 2007


On Mon, Dec 10, 2007 at 11:47:47AM +0100, Reimar D?ffinger wrote:
> Hello,
> On Sun, Dec 09, 2007 at 05:09:11PM -0500, Rich Felker wrote:
> > On Sun, Dec 09, 2007 at 11:18:32AM +0100, Reimar D?ffinger wrote:
> > > since Rich seems to have given up on it, here is a proposed patch
> > > that adds a av_check_utf8 function that could be used to validate
> > > input strings.
> > > Since it hacked it up very quickly please forgive any bugs or other
> > > stupidity.
> > 
> > Read RFC 3629. There's a very simple way to validate byte sequences
> > (using the given ABNF) without any decoding required, and it's less
> > likely to be buggy. Your patch relies on GET_UTF8 not being buggy,
> > which is quite doubtful IMO..
> 
> Now reading RFC 3629 was a useless exercise. Their ABNF certainly isn't
> my idea of simple (actually "mess" fits it better) and is mostly what
> I thought about as an alternative.
> Maybe it actually is less likely to be buggy, but this is not worth much
> here because:
> 1) if GET_UTF8 is broken our UTF-8 handling is most likely broken
> anyway, and I don't think it will help much if av_check_utf8 is not
> broken.

GET_UTF8 is only broken in the case where invalid sequences are passed
to it. As long as you already validated UTF-8, it's basically
impossible for the decoder to be buggy.

> 2) There is at least some chance that if GET_UTF8 ever breaks that
> somebody will notice it, whereas it is almost certain that av_check_utf8
> being replaced by a return NULL would go unnoticed for ages (even if
> we add a regression check that is problematic), so I actually do believe
> that using GET_UTF8 and a little bit of custom code will be more robust
> than a completely custom code.

Huh? This line of reasoning makes no sense. Again consider where the
breakage is: it's only for INVALID input, not for valid input. This is
unlikely to be noticed by anyone competent to file a bug report
because someone competent won't be passing invalid UTF-8 in normal
usage. Regardless, I already DID report bugs in GET_UTF8 (from
RTFS'ing not from experience encountering them :) and the response I
got was essentially a "WONTFIX" even though I sent patches...

> Btw.: If that is the reason for concern I will happily clearly note
> that av_check_utf8 should _never_ be used for security-critical checks,
> it is only to warn the user if e.g. a command-line string that should be
> in UTF-8 is not.

This is silly. Warning is not acceptable; if a file format specifies
that it stores UTF-8 then lavf MUST NOT store data into it unless the
data is well-formed UTF-8. WAAAY too many users have wrong locale
settings, etc. and it's incredibly irresponsible to be one of the
broken programs that's ignoring this reality and generating invalid
files.

Validating UTF-8 is trivial. Again see the ABNF. If you don't want to
write the code I'll write it...

Rich




More information about the ffmpeg-devel mailing list