[FFmpeg-devel] [RFC] function to check for valid UTF-8 string

Reimar Döffinger Reimar.Doeffinger
Mon Dec 10 20:38:56 CET 2007


Hello,
On Mon, Dec 10, 2007 at 11:33:57AM -0500, Rich Felker wrote:
> On Mon, Dec 10, 2007 at 11:01:59AM -0500, Rich Felker wrote:
> > Validating UTF-8 is trivial. Again see the ABNF. If you don't want to
> > write the code I'll write it...
> 
> I just wrote (or rather adapted from libc) the code but I don't have
> time to check for mistakes at the moment and I don't feel like being
> ridiculed for any silly errors I made. I'll post it later once I
> reread and test it.

I'm interested to see what you did, but I think I will clearly win the
ugliness contest with this (I doubt I want this to actually be used,
though it is interesting that gcc actually manages to unroll the inner
loop with -O3 and replaces the arrays by and/cmp with constants):

const char *check_utf8(const char *in) {
    static const uint32_t masks[]    = {0xf8c0c0c0, 0xf0c0c0, 0xe0c0, 0x80};
    static const uint32_t vals[]     = {0xf0808080, 0xe08080, 0xc080, 0x00};
    static const uint32_t anymasks[] = {0x07300000, 0x0f2000, 0x1e00, 0x7f};
    const uint8_t *str = in;
    while (*str) {
        long i = 3;
        uint32_t v = *str++;
        while ((v & masks[i]) != vals[i] || !(v & anymasks[i])) {
            if (--i < 0 || !*str) return str - 3 + i;
            v = (v << 8) | *str++;
        }
    }
    return NULL;
}




More information about the ffmpeg-devel mailing list