[FFmpeg-devel] Pipeline: H.264 speed improvements

Tue Dec 23 10:08:26 CET 2008

I've put together a list of all the possible speed improvements I can
see, including both some obvious ones and non-obvious ones.  If you're
interested in implementing anything here, say so to make sure your
work isn't duplicated by Michael or I.  Also feel free to discuss some
of the more nutty ideas, like the VLC table, or tell me that I'm wrong
about something.

Non-assembly stuff:

Fix filter_mb_fast: figure out exactly which cases it fails on, don't
call it for those, and enable it.  The cavlc/8x8dct/deblock stuff
should be moved out of filter_mb since otherwise it would be
duplicated in both filter_mb and filter_mb_fast.

Deblock logic can almost surely be made cleaner.  For example,
separate deblock_intra calls from deblock_inter like x264 does, to
avoid the branching (if bs < 4) inside filter_edge.

Write a large unified VLC table for CAVLC level decoding, much like
x264 recently implemented for encoding (you don't need to create a
full table--it isn't worth the memory cost--but a small table is
enough).  This would probably work like this:
1.  show_bits X bits, where X is log2(VLC table size).
2.  If (prefix length + suffix length) > X, do an escaped read.  Since
the entire point of this is to avoid calculating the prefix length,
the best way to do this is to have the prefix length stored *along
with* the value of the levelcode in the VLC table, or even better,
simply store a coeff of "0" in the table in the case that an escape is
needed, since a coeff of zero cannot possibly actually exist in the
table for us to confuse it with.
3.  Skip_bits the actual number of bits in the code, or don't skip any
bits if the escape has to be called.
4.  So we can do coeff = VLCtable[suffix_length][inputvalue].coeff;
if(!coeff) {escape table;} else
skip_bits(VLCtable[suffix_length][inputvalue].bits).  Coeff can
probably be an 8-bit signed value as long as X <= 8, making this a
really compact table (16 bits per entry).

av_log2 is unnecessarily powerful for use in h264.c.  All signed
golomb values in H.264 fit in 16-bit, and all unsigned golomb values
other than headers fit in 8-bit.  Thus all ordinary unsigned golomb
code reads can literally be put in a 256-byte VLC table and replaced
with a single array lookup.

Assembly stuff:

Port the SSE2 iDCT from x264.

Write an SSE2 iDCT_DC and use for pure-DC blocks, like an i16x16 with
no AC residual.

Modify the MMX iDCT_DC to do an 8x8 block instead of 4x4 and use it to
handle the (extremely common) case of DC-only chroma.  This should be
guaranteed beneficial as the case of DC-only chroma can be derived
from the cbp, requiring no extra logic.

Modify the x264 dequant_dc to work for i16x16 luma DC and port it to
lavc along with the MMX inverse luma transform.

Write an SSE2 version of weight and biweight.  pmulhrsw might be
usable for weight (not sure if pmaddubsw is better), allowing for an
SSSE3 version.  x264 has an SSSE3 biweight using pmaddubsw that can be
ported as well (in addition to SSE2/MMX versions using pmul).

Port all of x264's intra prediction asm.

Dark Shikari