[FFmpeg-devel] [PATCH] MMX implementation of VC-1 inverse transforms

Loren Merritt lorenm
Fri Jan 18 11:05:04 CET 2008

On Wed, 16 Jan 2008, Ivan Kalvachev wrote:
> On Jan 14, 2008 10:33 PM, Loren Merritt <lorenm at u.washington.edu> wrote:
>> A transposed scantable and a column/transpose/column
>> transform is faster than a row/column transform for iDCT and iHCT, I 
>> have no reason to doubt that applies to VC1's transform as well.
> Is there some theoretical explanation of this statement?

Because with very few exceptions, x86 SIMD instructions operate 
element-wise on a pair of registers, not on pairs of values within one 
register. Furthermore, any DCT more complex than a brute-force matrix 
multiply won't perform the same operation on all coefficients at every 
step. So even after you shuffle things around so that you can operate on 
the right pairs of coefficients (using actual shuffle instructions 
whereas column just takes different register names), some of the 
arithmetic will be wasted.

> I'm sure you have actually tested both cases and I really want to peek
> at the h264 code that works without transpose, if you still keep it
> around.

There never was a row/column h264 idct in ffmpeg, but you can look at 
x264_add8x8_idct8_mmx that was changed from row/column to 
column/transpose/column in x264 r463.

> Loren, can you make simple_mmx even faster? (you would write it
> quicker than I could possibly write h264 inverse transform without
> transpose).

I'll post a patch once it's cleaned and generalized. As of now it's x86_64 
ssse3 only, and twice as fast as simple_mmx. I'll have to see how much of 
that speed depends on pmulhrsw and the extra xmmregs.

--Loren Merritt

More information about the ffmpeg-devel mailing list