[Ffmpeg-devel] VP3/Theora Perfection
Mike Melanson
mike
Mon May 16 22:10:29 CEST 2005
Michael Niedermayer wrote:
> u mean the 32 *1 and 32 *3 ones? *3 is just x+x+x or x+(x<<1) and gcc will
> change this for you
If you say so. If that works, great. Actually, I was thinking of 16 *3
mults per edge. A coded fragment will have its top and left edges
filtered. If either of the fragments to the bottom or right are not
coded in the same frame, the filter will also be run over those edges.
> anyway, vp3.c is very inefficiently written
It's a miracle vp3.c got written at all...
> * the switch / case mess used for some vlc decoding
Expound. Are you talking about the unpack_token() function? That is
called a lot and perhaps should be inline'd. Otherwise, the actual
switch/case logic should reduce to a jump table. On2's original code
used a strange tree of if-else branches. I always assumed they were
trying to manually optimize the code flow.
> * the if(get_bits1()) branch trees for the remainng vlcs
I have wanted to revise get_superblock_run_length(),
get_fragment_run_length(), and get_mode_code() so that they all use a
VLC table. I thought you might call that a waste of time, though.
> * dquant+idct which is passed a coeff_count which is always 64 (note i didnt
> check that but it has to be as the code wont work if it werent 64),
Hmm, I do not think that coeff_count needs to be tracked as part of a
fragment. It looks like last_coeff was supposed to be sent to
dquant+idct. The reason for this (ideally) is to select between 3
different IDCTs depending on the number of non-zero coeffs (On2 is
really proud of this since it seems to show up in all of their codecs).
> actually
> the dequant should be done during bitstream decoding
Why? Dequantization is a parallelizable operation that can be optimized
with SIMD instructions. That is why it is done at the same time as the
optimized IDCTs.
> * the idct uses its own API incompatible to the idct system used in lavc
And the CPU-specific optimizations have been ported over. These same
implementations will also be reused when VP5 and VP6 are reverse
engineered (probably VP4 as well, but that is lower priority).
> * using a 2*width*height array to store dct coefficients, which is memset(0)
> for every frame
> * no slices
> * the loop filter is applied after the whole frame has been decoded
To address these issues, it may be necessary to rework the render
process. Render slice 0 (all planes). Render slice 1, apply loop filter
on slice 0, dispatch slice 0. Render slice (n), apply loop filter on
slice (n-1), dispatch slice (n-1).
> * mmx.h based asm code (slow due to gcc bugs, and problematic due to
bugs in
> mmx.h)
>>has MMX and SSE2 optimizations that I can port over when I am confident
>>that the C-based loop filter works.
>
>
> note, please do not use mmx.h,
Please give me a good reason. I have checked code generated from mmx.h
against objdump and the generated ASM is correct.
> furthermore are you sure the original on2
> source is under a lgpl compatible license? maybe it is, iam just asking
The original On2 source code was released as GPL. Maybe I took a
logical leap since I knew that Theora would be based on the original
code and Xiph uses a much looser license. I think I have some email
correspondence on this somewhere.
> and why port instead of writing our own, the loops are relatively trivial?
Maybe trivial according to you. And there is no way I am writing new
ASM functions using that AT&T syntax slop.
Thanks for tracking down that dequantizer bug. That was in there since
the first iteration 2 years ago.
--
-Mike Melanson
More information about the ffmpeg-devel
mailing list