[FFmpeg-devel] [PATCH] VC-1 MMX DSP functions

Christophe GISQUET christophe.gisquet
Sat Oct 13 13:28:22 CEST 2007


Michael Niedermayer a ?crit :
>> Agreed. However, you trade memory loads/unpacks for potentially worse
>> code parallelism/pairing and size (there are 4 loops unrolled here). I
>> wonder if that'll be a win. I leave that to a later patch.
> you have unrolled the loops in the horizontal direction that also increased
> the code size and instruction pairing is specific to the good old pentium
> it has no relevance today

Figures anyway will put to rest this discussion. For
vc1_put_ver_16b_shift2_mmx, with pmullw used instead of shift+add:
2979 dezicycles in ver, 524174 runs, 114 skips
(compared to ~3300 initially)

Now if, contrary to what your suggestion hinted at, we unroll the
vertical loop:
2633 dezicycles in ver, 524208 runs, 80 skips

Is the code size 2x increase worth the 10% speed up?

All of this can be tested by checking #if 0" block in
vc1_put_ver_16b_shift2_mmx code or, globally, VERT_PIPELINE macro.

I also used your suggestion for the stride==offset case in
stride==offset and pipeline (unrolled because simpler to code):
2162 dezicycles in norm_pipe, 262091 runs, 53 skips
2528 dezicycles in norm, 524200 runs, 88 skips

This ~20% speed-up does result in also a 2x size increase for the
function. Not unrolling would I guess yield ~10% and 1.5x code size.

Attached patch allows to test/verify/report those figures.

Best regards,
Christophe GISQUET
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vc1dsp.diff
Type: text/x-patch
Size: 31938 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071013/35dfe719/attachment.bin>

More information about the ffmpeg-devel mailing list