[FFmpeg-devel] [BUG] UTF-8 decoder vulnerable to character spoofing attacks

Rich Felker dalias
Tue Oct 23 06:42:20 CEST 2007

On Mon, Oct 22, 2007 at 09:44:40PM -0600, Loren Merritt wrote:
> On Mon, 22 Oct 2007, Rich Felker wrote:
> >On Mon, Oct 22, 2007 at 07:15:35PM +0200, Reimar D?ffinger wrote:
> >
> >>Well, I always thought those bugs are due to extremely bad practices in
> >>checking data. At least I always considered UTF-8 as a method of
> >>compressing 32 bit data.
> >
> >UTF-8 is not a compression algorithm. It's a character encoding. This
> >is like the first FAQ (or rather frequent pitfall) about UTF-8.
> What is the distinction? As a multimedia developer, "compression" and 
> "encoding" mean the same to me.

Compression is a method of reducing the size of data for storage or

Encoding is agreeing upon a meaning for symbols so that they can be
generated and interpreted unambiguously by the sender and recipient.

ASCII is an encoding. Latin-1 is an encoding. ShiftJIS is an encoding.
UCS-4 is an encoding. None of these have anything to do with
compression; they're about assigning meaning to symbols. You can view
them as compression if you think of the symbols as coming from some
larger set of symbols, but that's misleading since these encodings are
not able to represent your platonic larger set of symbols at all.

If UTF-8 is a compression, then every form of data representation is
compression, which might be true in some abstract sense but for
practical purposes just becomes a corruption of the language to the
point where "compression" no longer has any useful meaning..

> Unicode is a set of symbols (in the information theory sense, not to be 
> confused with "glyphs"), with a meaning attached to each. UTF-8 is an 
> assignment of a bitstring (between 1 and 4 bytes) to each symbol. That 
> sounds like the very definition of Variable Length Coding, i.e. 
> compression.

Compression is absolutely not a goal of UTF-8. This is reflected both
in the current documentation and in the original paper. The goal was
to create a nonambiguous consistent representation of a large set of
possible codepoints in an octet-oriented format not subject to byte
order or alignment issues, compatible with C strings, Unix filesystem
semantics, existing text-based network protocols, etc., such that it
would not have the atrociously bad issues that plagued earlier
multibyte encodings (CJK and UTF-1).

Note that one could just as easily view the UTF-8 byte sequences as
the fundamental platonic set of codes and the UCS-4 numbers as an
abberation or transformation thereof.

> UTF-8 has additional goals, like mapping each character string to a unique 
> bit string. But that has nothing to do with compressing or not. e.g. FFV1 
> also leaves no decisions up to the encoder (beyond a couple variants 
> selected in the header).

It's largely irrelevant. The point is that certain sequences are
defined to be invalid UTF-8 and must not be used. "C0 00" is not
"overly long sequence for U+0000". It's "meaningless illegal sequence
of bytes C0 80". If your implementation happens to interpret it as
U+0000, that is a bug/abberation, not something natural, regardless of
whether certain naive implementations happen to return U+0000 when
processing the sequence.


More information about the ffmpeg-devel mailing list