[FFmpeg-devel] [RFC] function to check for valid UTF-8 string
Reimar Döffinger
Reimar.Doeffinger
Mon Dec 10 20:38:56 CET 2007
Hello,
On Mon, Dec 10, 2007 at 11:33:57AM -0500, Rich Felker wrote:
> On Mon, Dec 10, 2007 at 11:01:59AM -0500, Rich Felker wrote:
> > Validating UTF-8 is trivial. Again see the ABNF. If you don't want to
> > write the code I'll write it...
>
> I just wrote (or rather adapted from libc) the code but I don't have
> time to check for mistakes at the moment and I don't feel like being
> ridiculed for any silly errors I made. I'll post it later once I
> reread and test it.
I'm interested to see what you did, but I think I will clearly win the
ugliness contest with this (I doubt I want this to actually be used,
though it is interesting that gcc actually manages to unroll the inner
loop with -O3 and replaces the arrays by and/cmp with constants):
const char *check_utf8(const char *in) {
static const uint32_t masks[] = {0xf8c0c0c0, 0xf0c0c0, 0xe0c0, 0x80};
static const uint32_t vals[] = {0xf0808080, 0xe08080, 0xc080, 0x00};
static const uint32_t anymasks[] = {0x07300000, 0x0f2000, 0x1e00, 0x7f};
const uint8_t *str = in;
while (*str) {
long i = 3;
uint32_t v = *str++;
while ((v & masks[i]) != vals[i] || !(v & anymasks[i])) {
if (--i < 0 || !*str) return str - 3 + i;
v = (v << 8) | *str++;
}
}
return NULL;
}
More information about the ffmpeg-devel
mailing list