[FFmpeg-devel] VP8 decoder optimization status

Tue Jun 29 04:48:40 CEST 2010

On Jun 28, 2010, at 10:09 PM, Jason Garrett-Glaser wrote:

> Here's a rough guide to what's done and what needs to be done before
> ffmpeg's VP8 decoder is as fast as a politician running away from an
> ethics committee.
> 
> x86 asm:
> 
> Done:
> 6-tap motion compensation
> bilinear motion compensation
> dc-only iDCT

We could have an 8x4 or whole-macroblock idct like in h264.

> luma dc WHT
> i16x16 intra pred
> i4x4 intra pred (V, DC, TM)
> 
> TODO:
> Normal loopfilter
> Simple loopfilter
> regular iDCT (patch by Ronald is on ML)
> i4x4 intra pred (DDL, DDR, VR, HD, VL, HU)
> 
> ARM/PPC asm: nothing done yet

I've been working on random bits of altivec and armv6 (not neon yet since that has a chance of other people doing it)

> C:
> 
> Fully convert vp5/6/7/8 arithmetic coder to bytestream: eliminate the
> looped renormalization.
> Port all of x264's and ffh264's optimizations once the above is done
> (since they'll now be relevant).
> Convert vp5/6/7/8 arithmetic coder to use a larger cache size (maybe
> 16-bit or 32-bit?) for fewer bytestream reads.
> Optimize decode_block_coeffs (it can surely be made faster).

decode_block_coeffs is the most important arith coding bit; libvpx has a special manually inlined arith decoder for this that includes jumping directly to reading the next binary decision from the same branch for the current binary decision. The huff tree is inlined into the code too without the current loop. There's also more effort to keep the necessary variables in registers throughout all blocks in a mb.

It's obviously a lot less readable but should be decently faster.

> Improve edge emulation handling (we currently have the worst of both
> worlds -- we require padding on the edges, yet we use the slow
> ff_emulated_edge_mc -- we should pick one method or the other).

We only pad 16 pixels on each edge, so we'd have to keep using ff_emulated_edge_mc for mvs too far off frame. But there's definitely room for improvement in using it less.

Intra pred is the only reason for requiring the edge atm, for one row of pixels on the top/left. The best solution I can think of for this is predicting into a temp buffer and copying to the real image for the top/left blocks.

> Optimize cache handling (mvs and nnz).
> Optimize MV prediction.
> Probably lots of other stuff I haven't thought of, feel free to
> contribute ideas.

Do border_xchg and such to be able to run the loop filter an a mb row immediately after decoding it, rather than the prior mb row.
Related to the decode_block_coeffs, I couldn't decide on the best method for dequant, the current qmul[!!i] isn't good.
Use the MB_TYPE macros to be able to check i4x4 and sub-16x16 in one op, since both are checked at once several places.

> The current top priority for x86 speed is by far and away the Normal
> loopfilter -- it's something like 60-70%+ of the total time, since
> we've SIMD-optimized nearly everything else of note.
> 
> Dark Shikari