[FFmpeg-devel] [PATCH] MMX implementation of VC-1 inverse transforms

Ivan Kalvachev ikalvachev
Wed Jan 16 03:05:54 CET 2008


On Jan 14, 2008 10:33 PM, Loren Merritt <lorenm at u.washington.edu> wrote:
> On Mon, 14 Jan 2008, Ivan Kalvachev wrote:
>
> > - Why you choose to transpose at all. Just to save time and effort?
> > It is usual to have separate version of SIMD depending if they work on
> > row or columns. The row and column stages are different and you pass
> > the differences as parameters.
>
> Who says it's usual?

It is usual because all IDCT functions in i386/dsp do it.
These include libmpeg2, xvid, simple_mmx . The only IDCT that mentions
transpose is vp3, but it also have separate col/row code.

Have in mind that I do not deny that mmx idct transforms use different
permutations to get the coefficients in order they like, order that
would make the first pass easier and eventually land the intermediate
results in order that doesn't require additional transformation on or
after second (col) pass.

> A transposed scantable and a column/transpose/column
> transform is faster than a row/columntransform for iDCT and iHCT, I have
> no reason to doubt that applies to VC1's transform as well.

Is there some theoretical explanation of this statement?

I'm sure you have actually tested both cases and I really want to peek
at the h264 code that works without transpose, if you still keep it
around.

On the other side, I'm not sure what do you mean by row/columntransform.
The usual operation of above mentioned idct-es is scantable/row/column.


> The only benefit of row/column is that pmaddwd adds a little bit of
> precision compared to a pure 16bit column transform. But that applies only
> to an integer approximation of a real DCT, not if the standard has
> already made the 16bit approximation.

Somehow I think that Michael would also test all cases and would pick
up the fastest code. And I know Michael does a lot of tests and
benchmarks when he writes something.
Loren, can you make simple_mmx even faster? (you would write it
quicker than I could possibly write h264 inverse transform without
transpose).

On Jan 14, 2008 11:01 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Mon, Jan 14, 2008 at 09:12:40PM +0100, Christophe GISQUET wrote:
> [...]
> > > - Am I wrong or you do all the math in 16 bit signed saturation mode?
> > > According to vc1 draft in first stage the input is in the range
> > > [-2048;2047] the multiply constants  are in range [-16;16], this makes
> > > range [-32768;32768] per multiply and you can have 8 of them.
> > > Or multiply constants in range [-22;22], that make range
> > > [-45056;45056] per multiply and you can have 4 of them.
>
> you are missing a detail here
> 45056 >> 3 would be > 4096 thus possibly violate the limit for the 2nd stage
> input. Still the 512 limit of the output with >>7 before does not look like
> the naive implementation will work with 16bit

It's not my fault that at M$ cannot math ;)
The draft actually says that the intermediate result have to be
saturated in that range. So it is possible that the C variant also
doesn't work according to the specs. I wonder what the reference
source does?


> > > In the second phase the input range is doubled to [-4096,4095]
> > >
> > > Are you sure your transforms produce the same result as their _c equivalents?
> >
> > I did test bit exactness (with win32 dll output) but albeit on few
> > sequences. Everything was perfect.
> >
> > The reference I found said it could be done on 16 bits maths.  Maybe it
> > needs a bias to correct, but as output is usually in the range
> > [-128;127], it's pretty symmetrical. However, indeed, it would be better
> > if proof could be given.
>
> theres a difference between "can be done" and "it works with the naive
> implementation"
>
> as random example:
>
> naive:
> (22*X+17*Y) >> 3 will not work with 16bit and X and or Y =2048
>
> alternative:
> ((X + ((X + (Y>>1))>>1))>>1) + 2*(X+Y) should work fine
>
>
> there are of course many intermediate variants
> the key point to keep in mind is that (2*x + y)>>1 == x + (y>>1)

I wonder if there is some collection of nasty mmx tricks, like the above one.




More information about the ffmpeg-devel mailing list