VP8 decoder optimization status

Jason Garrett-Glaser darkshikari
Tue Jun 29 04:09:04 CEST 2010

Here's a rough guide to what's done and what needs to be done before
ffmpeg's VP8 decoder is as fast as a politician running away from an
ethics committee.

x86 asm:

6-tap motion compensation
bilinear motion compensation
dc-only iDCT
luma dc WHT
i16x16 intra pred
i4x4 intra pred (V, DC, TM)

Normal loopfilter
Simple loopfilter
regular iDCT (patch by Ronald is on ML)
i4x4 intra pred (DDL, DDR, VR, HD, VL, HU)

ARM/PPC asm: nothing done yet


Fully convert vp5/6/7/8 arithmetic coder to bytestream: eliminate the
looped renormalization.
Port all of x264's and ffh264's optimizations once the above is done
(since they'll now be relevant).
Convert vp5/6/7/8 arithmetic coder to use a larger cache size (maybe
16-bit or 32-bit?) for fewer bytestream reads.
Optimize decode_block_coeffs (it can surely be made faster).
Improve edge emulation handling (we currently have the worst of both
worlds -- we require padding on the edges, yet we use the slow
ff_emulated_edge_mc -- we should pick one method or the other).
Optimize cache handling (mvs and nnz).
Optimize MV prediction.
Probably lots of other stuff I haven't thought of, feel free to
contribute ideas.

The current top priority for x86 speed is by far and away the Normal
loopfilter -- it's something like 60-70%+ of the total time, since
we've SIMD-optimized nearly everything else of note.

Dark Shikari

