[FFmpeg-devel] [PATCH 3/4] x86/vvcdec: inter, add optical flow avx2 code

Sun Aug 18 06:19:09 EEST 2024

On 8/17/2024 10:48 PM, Nuo Mi wrote:
> +    pxor                    m6, m6
> +    phaddw                 m%2, m6
> +    phaddw                 m%2, m6

Horizonal adds are slow. Can't you do this with normal adds, shifts and 
blend?

> +    vpermq                 m%2, m%2, q0020
> +    pshufd                 m%2, m%2, q1120
> +    pmovsxwd               m%2, xmm%2               ; 4 sgxgy
> +
> +    pmulld                 m%2, m11                 ; 4 vx * sgxgy

Similarly, pmulld is super slow (Ten cycles in some architectures), and 
that's on top of a pmovsx.
Since you have m6 zeroed already, wouldn't pmaddwd work here? The pd_15 
and pd_m15 constants would need to be changed to words, as would the 
values to be clipped.

> +    psrad                  m%2, 1