[Ffmpeg-devel] VP3/Theora Perfection

Mon May 16 22:10:29 CEST 2005

Michael Niedermayer wrote:
> u mean the 32 *1 and 32 *3 ones? *3 is just x+x+x or x+(x<<1) and gcc will 
> change this for you

	If you say so. If that works, great. Actually, I was thinking of 16 *3 
mults per edge.  A coded fragment will have its top and left edges 
filtered. If either of the fragments to the bottom or right are not 
coded in the same frame, the filter will also be run over those edges.

> anyway, vp3.c is very inefficiently written

	It's a miracle vp3.c got written at all...

> * the switch / case mess used for some vlc decoding 

	Expound. Are you talking about the unpack_token() function? That is 
called a lot and perhaps should be inline'd. Otherwise, the actual 
switch/case logic should reduce to a jump table. On2's original code 
used a strange tree of if-else branches. I always assumed they were 
trying to manually optimize the code flow.

> * the if(get_bits1()) branch trees for the remainng vlcs

	I have wanted to revise get_superblock_run_length(), 
get_fragment_run_length(), and get_mode_code() so that they all use a 
VLC table. I thought you might call that a waste of time, though.

> * dquant+idct which is passed a coeff_count which is always 64 (note i didnt 
> check that but it has to be as the code wont work if it werent 64), 

	Hmm, I do not think that coeff_count needs to be tracked as part of a 
fragment. It looks like last_coeff was supposed to be sent to 
dquant+idct. The reason for this (ideally) is to select between 3 
different IDCTs depending on the number of non-zero coeffs (On2 is 
really proud of this since it seems to show up in all of their codecs).

> actually 
> the dequant should be done during bitstream decoding

	Why? Dequantization is a parallelizable operation that can be optimized 
with SIMD instructions. That is why it is done at the same time as the 
optimized IDCTs.

> * the idct uses its own API incompatible to the idct system used in lavc

	And the CPU-specific optimizations have been ported over. These same 
implementations will also be reused when VP5 and VP6 are reverse 
engineered (probably VP4 as well, but that is lower priority).

> * using a 2*width*height array to store dct coefficients, which is memset(0) 
> for every frame

> * no slices
> * the loop filter is applied after the whole frame has been decoded

	To address these issues, it may be necessary to rework the render 
process. Render slice 0 (all planes). Render slice 1, apply loop filter 
on slice 0, dispatch slice 0. Render slice (n), apply loop filter on 
slice (n-1), dispatch slice (n-1).

 > * mmx.h based asm code (slow due to gcc bugs, and problematic due to 
bugs in
 > mmx.h)

>>has MMX and SSE2 optimizations that I can port over when I am confident
>>that the C-based loop filter works.
> 
> 
> note, please do not use mmx.h, 

	Please give me a good reason. I have checked code generated from mmx.h 
against objdump and the generated ASM is correct.

> furthermore are you sure the original on2 
> source is under a lgpl compatible license? maybe it is, iam just asking

	The original On2 source code was released as GPL. Maybe I took a 
logical leap since I knew that Theora would be based on the original 
code and Xiph uses a much looser license. I think I have some email 
correspondence on this somewhere.

> and why port instead of writing our own, the loops are relatively trivial?

	Maybe trivial according to you. And there is no way I am writing new 
ASM functions using that AT&T syntax slop.

	Thanks for tracking down that dequantizer bug. That was in there since 
the first iteration 2 years ago.
-- 
	-Mike Melanson