[FFmpeg-devel] [BUG] UTF-8 decoder vulnerable to character spoofing attacks

Tue Oct 23 05:44:40 CEST 2007

On Mon, 22 Oct 2007, Rich Felker wrote:
> On Mon, Oct 22, 2007 at 07:15:35PM +0200, Reimar D?ffinger wrote:
> 
>> Well, I always thought those bugs are due to extremely bad practices in
>> checking data. At least I always considered UTF-8 as a method of
>> compressing 32 bit data.
> 
> UTF-8 is not a compression algorithm. It's a character encoding. This
> is like the first FAQ (or rather frequent pitfall) about UTF-8.

What is the distinction? As a multimedia developer, "compression" and 
"encoding" mean the same to me.
Unicode is a set of symbols (in the information theory sense, not to be 
confused with "glyphs"), with a meaning attached to each. UTF-8 is an 
assignment of a bitstring (between 1 and 4 bytes) to each symbol. That 
sounds like the very definition of Variable Length Coding, i.e. 
compression.
UTF-8 has additional goals, like mapping each character string to a unique 
bit string. But that has nothing to do with compressing or not. e.g. FFV1 
also leaves no decisions up to the encoder (beyond a couple variants 
selected in the header).

--Loren Merritt