[FFmpeg-devel] [PATCH] vp9: 16bpp tm/dc/h/v intra pred simd (mostly sse2) functions.
Henrik Gramner
henrik at gramner.com
Fri Oct 2 23:31:53 CEST 2015
On Fri, Sep 25, 2015 at 11:24 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> +++ b/libavcodec/x86/vp9intrapred_16bpp.asm
> +cglobal vp9_ipred_v_4x4_16, 2, 4, 1, dst, stride, l, a
> +cglobal vp9_ipred_v_8x8_16, 2, 4, 1, dst, stride, l, a
> +cglobal vp9_ipred_v_16x16_16, 2, 4, 2, dst, stride, l, a
> +cglobal vp9_ipred_v_32x32_16, 2, 4, 4, dst, stride, l, a
Those look pretty generic. Isn't some H.264 pred very similar if not
identical? I didn't check, but if they are you can just use those
instead.
> +cglobal vp9_ipred_h_8x8_16, 3, 4, 5, dst, stride, l, a
Seemed a bit inefficient so i rewrote it. Around 2x as fast and fewer regs:
cglobal vp9_ipred_h_8x8_16, 3, 3, 4, dst, stride, l, a
mova m2, [lq]
DEFINE_ARGS dst, stride, stride3
lea stride3q, [strideq*3]
punpckhwd m3, m2, m2
pshufd m0, m3, q3333
pshufd m1, m3, q2222
mova [dstq+strideq*0], m0
mova [dstq+strideq*1], m1
pshufd m0, m3, q1111
pshufd m1, m3, q0000
mova [dstq+strideq*2], m0
mova [dstq+stride3q ], m1
lea dstq, [dstq+strideq*4]
punpcklwd m2, m2
pshufd m0, m2, q3333
pshufd m1, m2, q2222
mova [dstq+strideq*0], m0
mova [dstq+strideq*1], m1
pshufd m0, m2, q1111
pshufd m1, m2, q0000
mova [dstq+strideq*2], m0
mova [dstq+stride3q ], m1
RET
> +cglobal vp9_ipred_h_16x16_16, 3, 4, 6, dst, stride, l, a
> +cglobal vp9_ipred_h_32x32_16, 3, 5, 8, dst, stride, l, a
Should be possible to change those to be more similar to the 8x8 above.
> +cglobal vp9_ipred_dc_4x4_16, 4, 4, 2, dst, stride, l, a
[...]
> + pshufw m1, m0, q3232
> + paddd m0, m1
> + paddd m0, [pd_4]
Swap the last two rows to allow the shuffle and the pd_4 add to
execute in parallel. The same issue exists in pretty much every other
dc function as well.
> +cglobal vp9_ipred_dc_32x32_16, 4, 4, 2, dst, stride, l, a
[...]
> +.loop:
> + mova [dstq+strideq*0+ 0], m0
> + mova [dstq+strideq*0+16], m0
> + mova [dstq+strideq*0+32], m0
> + mova [dstq+strideq*0+48], m0
> + mova [dstq+strideq*1+ 0], m0
> + mova [dstq+strideq*1+16], m0
> + mova [dstq+strideq*1+32], m0
> + mova [dstq+strideq*1+48], m0
> + mova [dstq+strideq*2+ 0], m0
> + mova [dstq+strideq*2+16], m0
> + mova [dstq+strideq*2+32], m0
> + mova [dstq+strideq*2+48], m0
> + mova [dstq+stride3q + 0], m0
> + mova [dstq+stride3q +16], m0
> + mova [dstq+stride3q +32], m0
> + mova [dstq+stride3q +48], m0
> + lea dstq, [dstq+strideq*4]
> + dec cntd
> + jg .loop
Cut the number of stores per iteration in half and double the number
of iterations instead.
> +cglobal vp9_ipred_dc_%1_32x32_16, 4, 4, 2, dst, stride, l, a
[...]
> +.loop:
> + mova [dstq+strideq*0+ 0], m0
> + mova [dstq+strideq*0+16], m0
> + mova [dstq+strideq*0+32], m0
> + mova [dstq+strideq*0+48], m0
> + mova [dstq+strideq*1+ 0], m0
> + mova [dstq+strideq*1+16], m0
> + mova [dstq+strideq*1+32], m0
> + mova [dstq+strideq*1+48], m0
> + mova [dstq+strideq*2+ 0], m0
> + mova [dstq+strideq*2+16], m0
> + mova [dstq+strideq*2+32], m0
> + mova [dstq+strideq*2+48], m0
> + mova [dstq+stride3q + 0], m0
> + mova [dstq+stride3q +16], m0
> + mova [dstq+stride3q +32], m0
> + mova [dstq+stride3q +48], m0
> + lea dstq, [dstq+strideq*4]
> + dec cntd
> + jg .loop
Ditto.
> +cglobal vp9_ipred_tm_4x4_10, 4, 4, 6, dst, stride, l, a
[...]
> + movd m0, [aq-2]
> + pshufw m0, m0, q0000
Unaligned load penalty, either movd from [aq-4] or pshufw directly from [aq-8].
> +cglobal vp9_ipred_tm_8x8_10, 4, 4, 8, dst, stride, l, a
[...]
> + movd m0, [aq-2]
> + pshuflw m0, m0, q0000
Ditto, except you don't want to pshuflw directly from memory in this
case unlike with MMX. You can use vpbroadcastw instead though if you
want to write AVX2. This issue exists in multiple other places as
well.
> + pshufhw m0, m4, q3333
> + pshufhw m1, m4, q2222
> + pshufhw m2, m4, q1111
> + pshufhw m3, m4, q0000
> + punpckhqdq m0, m0
> + punpckhqdq m1, m1
> + punpckhqdq m2, m2
> + punpckhqdq m3, m3
Use punpckhwd + pshufd instead, same as in vp9_ipred_h_8x8_16 above.
Otherwise OK.
More information about the ffmpeg-devel
mailing list