[FFmpeg-devel] [PATCH] VP8 MMX optimizations (MC and IDCT dc_add)
Wed Jun 23 09:16:40 CEST 2010
except for the patch being full of hunks I would have described as
I wanted to comment on this:
2010/6/23 Jason Garrett-Glaser <darkshikari at gmail.com>:
> +cglobal put_vp8_epel4_h4_mmxext, 5,5
> + shl r4, 4
> + sub r0, r1
> + mova m4, [fourtap_filter_hw+r4-16] ; set up 4tap filter in words
> + mova m5, [fourtap_filter_hw+r4]
> + mova m7, [pw_64]
> + pxor m6, m6
> + movu m1, [r1-1] ; (ABCDEFGH) load 8 horizontal pixels
> + ; first set of 2 pixels
> + mova m2, m1 ; byte ABCD..
> + punpcklbw m1, m6 ; byte->word ABCD
> + pshufw m0, m2, 9 ; byte CDEF..
> + punpcklbw m0, m6 ; byte->word CDEF
> + pshufw m3, m1, 0x94 ; word ABBC
> + pshufw m1, m0, 0x94 ; word CDDE
> + pmaddwd m3, m4 ; multiply 2px with F0/F1
> + mova m0, m1 ; backup for second set of pixels
> + pmaddwd m1, m5 ; multiply 2px with F2/F3
> + paddd m3, m1 ; finish 1st 2px
The vc1 mc code uses unsaturating arith, and thus avoid intermediate
results in dwords.
I may try to bench what this alternate implementation would bring to
that part of the vp8 mc patch.
Also, this avoids code size increase, but when considering this:
> +sixtap_filter: dw 2, -11, 108, 36, -8, 1, \
> + 3, -16, 77, 77, -16, 3, \
There seems to be twice as many pmullw/... done as necessary.
More information about the ffmpeg-devel