[FFmpeg-devel] [PATCH]v5 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Sun Jul 23 03:47:45 EEST 2017

On 22 July 2017 at 12:18, Ivan Kalvachev <ikalvachev at gmail.com> wrote:

> This patch is ready for review and inclusion.
>
>
>
>+%macro emu_blendvps 3 ; dst/src_a, src_b, mask
>+%macro lea_pic_base 2; reg, const_label
Capitalize macro names.

>+      %error sse41 blendvps uses xmm0 as default 3d operand, you used %3
Remove this error

>+    %error "blendvps" emulation needs SSE
Definitely remove this error too.

>+        %error AVX1 does not support integer 256bit ymm operations
And this one is particularly pointless...

>+      %error sse41 blendvps uses xmm0 as default 3 operand, you used %3
Same...

>+    %error "broadcastss" emulation needs SSE
Same...

>+    %error "pbroadcastd" emulation needs SSE2
Yep, the same...

>+    %error pick HSUMPS implementation
Again...

All of these are pointless and unnecessary. Always assume at least SSE2.

>+
>+; 256bit version seems slower than the 128bit AVX on all current CPU's
>+; So disable it for now and keep the code in case something changes in
future.
>+%if HAVE_AVX2_EXTERNAL > 0 && 0
>+INIT_YMM avx2
>+PVQ_FAST_SEARCH
>+%endif

Remove this altogether.

>+%if 1
>+    emu_pbroadcastd m1,   xmm1
...
>+%else
>+        ; Ryzen is the only CPU at the moment that doesn't have
>+        ; write forwarding stall and where this code is faster
>+        ; with about 7 cycles on avarage.
>+        %{1}ps      m5,   mm_const_float_1
>+        movss       [tmpY + r4q], xmm5      ; Y[max_idx] +-= 1.0;

Remove the %else and always use the first bit of code.

>+%if cpuflag(avx2) && 01
>+%elif cpuflag(avx) && 01
>+%if cpuflag(avx2) && 01

Remove the && 01 in these 3 places.

>+;      vperm2f128   %1,   %1,   %1,   0 ; ymm, ymm, ymm     ; 2-3c 1/1

Remove this commented out code.

>+cglobal pvq_search, 4, 5+num_pic_regs, 11, 256*4, inX, outY, K, N
>+%if ARCH_X86_64 == 0    ;sbrdsp

You're using more than 11 xmm regs, so the entire code is always going to
need a 64 bit system.
So wrap the entire file, from top to bottom after the %include in
%if ARCH_X86_64
<everything>
%endif

>+SECTION_RODATA 64
>+
>+const_float_abs_mask:   times 8 dd 0x7fffffff
>+const_align_abs_edge:   times 8 dd 0
>+
>+const_float_0_5:        times 8 dd 0.5
>+const_float_1:          times 8 dd 1.0
>+const_float_sign_mask:  times 8 dd 0x80000000
>+
>+const_int32_offsets:
>+                        %rep 8
>+                                dd $-const_int32_offsets
>+                        %endrep
>+SECTION .text

This whole thing needs to go right at the very top after the %include. All
macros must then follow it.
Having read only sections in the middle of the file is very confusing and
not at all what all of our asm code follows.