[FFmpeg-devel] [PATCH] MMX implementation of VC-1 inverse transforms
Wed Jan 16 22:16:30 CET 2008
considering the amount of rework, mostly because of my oversight of the
overflows, I'll split next patches like:
- first, i4x4
- then i8x8
- then i4x8 / i8x4
This mail is therefore only for things related to i4x4. I'll start a new
thread if requested to.
Michael Niedermayer a ?crit :
> you do not need temporary storeage
> the butterflies can be implemented like:
'Trick' used as far as I could get (which doesn't mean it's far...) in
the current patch.
>> +static void vc1_inv_trans_8x8_mmx(DCTELEM block)
>> + transpose8x8_mmx(block);
> all initial permutations (here a transpose) MUST be merged into the scantable
> all other codecs do this too! vc1 wont become an exception
Pending a decision on how to signal that the zz scantable must be
transposed at loading, I've left the useless transpose in there. It'll
just be a matter of not calling the macro and propagating the new
>> +#define IDCT4_1D(R0, R1, R2, R3, TMP1, TMP2, TMP3, SHIFT) \
> same as above the multiply can be done before the butterfly and
> thus 1 bias add can be avoided
The solution I came up with to avoid overflow problems ((8*A+B)>>3 = 8 +
(B)>>3) doesn't seem to allow me such trick.
This solution has its share of problems:
- forces me to perform the butterflies twice
- waste of mm7, but don't know where to use it
- not very readable...
I hope I haven't missed too many obvious optimizations this time...
Currently it clocks at 1339 dezicycles (vs 2100 for the improved C
version), so it's 20% slower than my previous, overflowing version.
Maybe an improved version of the later could be kept for flags2 fast...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 4928 bytes
Desc: not available
More information about the ffmpeg-devel