[FFmpeg-devel] [patch][OpenHEVC]added ASM functions for epel + qpel
Ronald S. Bultje
rsbultje at gmail.com
Sat Mar 8 13:54:02 CET 2014
Hi,
> +cglobal hevc_put_hevc_epel_hv12_8, 7, 11, 12 , dst, dststride, src,
srcstride, height, mx, my, r3src, tsrc, rfilter
[..]
> +.loop
> + EPEL_LOAD 8, srcq, 1, 12
> + EPEL_COMPUTE 8, 6, m14, m15
> + SWAP m4, m0
> + lea tsrcq, [srcq + srcstrideq]
> + EPEL_LOAD 8, tsrcq, 1, 12
> + EPEL_COMPUTE 8, 6, m14, m15
> + SWAP m5, m0
> + lea tsrcq, [tsrcq + srcstrideq]
> + EPEL_LOAD 8, tsrcq, 1, 12
> + EPEL_COMPUTE 8, 6, m14, m15
> + SWAP m6, m0
> + lea tsrcq, [tsrcq + srcstrideq]
> + EPEL_LOAD 8, tsrcq, 1, 12
> + EPEL_COMPUTE 8, 6, m14, m15
> + SWAP m7, m0
> + punpcklwd m0, m4, m5
> + punpckhwd m1, m4, m5
> + punpcklwd m2, m6, m7
> + punpckhwd m3, m6, m7
> + EPEL_COMPUTE 14, 8, m12, m13
> + PEL_STORE8 dstq, m0, m1
[.. that again for next 4 pixels ..]
> + LOOP_END dst, dststride, src, srcstride
> + RET
So, this is going to be _hugely_ inefficient, right? You're basically
redoing all 4 horizontal passes for each 1 output line (i.e. 4xn_lines),
rather than 3+n_lines.
I can only imagine that you're doing that because you may not have enough
registers to cache 8+4 pixels (to make 12 in total), but really, if that's
the case, just write a C wrapper around 8+4. That'll be tons faster than
this.
> +cglobal hevc_put_hevc_epel_hv12_10, 7, 11, 12 , dst, dststride, src,
srcstride, height, mx, my, r3src, tsrc, rfilter
Same comment for this one.
Ronald
More information about the ffmpeg-devel
mailing list