[FFmpeg-devel] [RFC] function to check for valid UTF-8 string

Rich Felker dalias
Tue Dec 11 05:43:57 CET 2007


On Mon, Dec 10, 2007 at 08:38:56PM +0100, Reimar D?ffinger wrote:
> Hello,
> On Mon, Dec 10, 2007 at 11:33:57AM -0500, Rich Felker wrote:
> > On Mon, Dec 10, 2007 at 11:01:59AM -0500, Rich Felker wrote:
> > > Validating UTF-8 is trivial. Again see the ABNF. If you don't want to
> > > write the code I'll write it...
> > 
> > I just wrote (or rather adapted from libc) the code but I don't have
> > time to check for mistakes at the moment and I don't feel like being
> > ridiculed for any silly errors I made. I'll post it later once I
> > reread and test it.
> 
> I'm interested to see what you did, but I think I will clearly win the
> ugliness contest with this (I doubt I want this to actually be used,
> though it is interesting that gcc actually manages to unroll the inner
> loop with -O3 and replaces the arrays by and/cmp with constants):
> 
> const char *check_utf8(const char *in) {
>     static const uint32_t masks[]    = {0xf8c0c0c0, 0xf0c0c0, 0xe0c0, 0x80};
>     static const uint32_t vals[]     = {0xf0808080, 0xe08080, 0xc080, 0x00};
>     static const uint32_t anymasks[] = {0x07300000, 0x0f2000, 0x1e00, 0x7f};
>     const uint8_t *str = in;
>     while (*str) {
>         long i = 3;
>         uint32_t v = *str++;
>         while ((v & masks[i]) != vals[i] || !(v & anymasks[i])) {
>             if (--i < 0 || !*str) return str - 3 + i;
>             v = (v << 8) | *str++;
>         }
>     }
>     return NULL;
> }

Your code is incorrect. It considers "ed a0 80" and "f5 80 80 80"
valid, contrary to the definition of UTF-8 and just like the buggy
UTF-8 decoder already in ffmpeg.

Here is my implementation. Feel free to optimize as long as you keep
it correct:

int is_valid_utf8(const unsigned char *s)
{
	/* bounds table to use for 0xe0 thru 0xf4 lead bytes */
	static const unsigned char bmap[] = {
		1, 0, 0, 0, 0, 0, 0, 0,
		0, 0, 0, 0, 0, 2, 0, 0,
		3, 0, 0, 0, 4
	};
	/* valid byte bounds tables in the form { start, length } */
	static const unsigned char bounds[][2] = {
		{ 0x80, 0x40 }, { 0xa0, 0x20 }, { 0x80, 0x20 },
		{ 0x90, 0x30 }, { 0x80, 0x10 }
	};
	unsigned b, i;
	unsigned char k;

	while ((b=*s++)) {
		if (b < 0x80) continue;
		else if (b - 0xc2 > 0xf4 - 0xc2) return 0;
		k = b << 1;
		if (b < 0xe0) i = 0;
		else i = bmap[b-0xe0];
		if ((unsigned)*s++ - bounds[i][0] >= bounds[i][1]) return 0;
		while (((k<<=1) & 0x80))
			if ((unsigned)*s++ - 0x80 >= 0x40) return 0;
	}
	return 1;
}

Rich




More information about the ffmpeg-devel mailing list