[FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Ivan Kalvachev ikalvachev at gmail.com
Wed Aug 2 00:46:20 EEST 2017


On 7/31/17, Henrik Gramner <henrik at gramner.com> wrote:
> On Wed, Jul 26, 2017 at 4:56 PM, Ivan Kalvachev <ikalvachev at gmail.com>
> wrote:
>> +++ b/libavcodec/x86/opus_pvq_search.asm
>
> Generic minor stuff:
>
> Use rN instead of rNq for numbered registers (q suffix is used for
> named args only due to preprocessor limitations).

Done.

Is this documented?

> Use the same "standard" vertical alignment rules as most existing
> code, e.g. instructions indented by 4 spaces and operands aligned on
> the first comma.

What do you mean operands aligned on the first comma?
The longest operand I had is "xmm0" , counting comma and space
I get 6 position alignment for operands.
Now, with "xm7" i can lower this to 5 position. Should I do that?
(I don't have "xm15" ).

Also, I've aligned the operand at 12 positions after their
instruction start, because this is the size of the longest
real instruction.

As for why i have instruction start at 8'th position.
It's because I've allocated the field at position 4-7
for instruction prefixes, and the "EMU_" is 4 positions.

Now I have:
12345678____________12345_12345_12345_
    EMU_broadcastss ym13, xm10
    EMU_blendvps    xm1,  xm2,  m3
       vblendvps
        blendvps

I can ditch the prefix and shorten the operands. e.g. :
1234_____________1234_1234_1234_
    VBROADCASTSS ym1, xm1
    BLENDVPS     m1,  m2,  m3

Is that acceptable? Or maybe you do mean:

1234_____________1234_1234_123
    VBROADCASTSS ym1, xm1
    BLENDVPS      m1, m2, m3

However I would prefer to use
1234_____________1234_1234_123
    VBROADCASTSS ym1, xm1
    BLENDVPS      m1,  m2,  m3
    PBLENDVD     xm1, xm2, xm3

(You do need fixed width font to see that correctly).

I'll wait for reply before doing that.

> Use xmN instead of xmmN (only really makes a difference when SWAP:s
> are used, but might as well do it "correctly").

Done.

> Use 32-bit operands, e.g. rNd, when 64-bit math isn't required.

Done

> Unless aligning every single loop entry helps a lot I'd avoid it since
> it does waste a bit of icache.

I'll redo my benchmarks, but I have placed these aligns for a reason.
At some point removed debug instructions started making my code
slower. So I've placed align to avoid random slowdowns.

> Explicitly using the carry flag as a branch condition is a bit weird.
> Generally using jg/jl/jge/jle tends to be clearer.

I use it to take advantage of the so called MacroFusion.
It allows the merge of cmp and jxx, as long as the branch
checks only one flag, so all signed branches are to be avoided
(as stated by intel manual).
The newer the cpu, the more opcodes (add/sub)
could be merged and less limitations.

>> +%ifdef __NASM_VER__
>> +%use "smartalign"
>> +ALIGNMODE p6
>> +%endif
>
> Assembler-specific ifdeffery should be avoided in individual files.
> Something equivalent already exists in x86inc actually but I don't
> remember if it was merged to FFmpeg from upstream (x264) yet.

Definitely not merged.

I hear a lot about the improved x86inc,
maybe it is time for you to merge it in FFmpeg?


>> +const_int32_offsets:
>> +                        %rep 8
>> +                                dd $-const_int32_offsets
>> +                        %endrep
>
> Isn't this just an overly complicated way of expressing "dd 0*4, 1*4,
> 2*4, 3*4, 4*4, 5*4, 6*4, 7*4"?

Yes.
Do you know a way that works with "times 8"?
I've done it this way to make it easy to change the size of the constant.

Do you request I change it?

>> +SECTION .text
>> +
>> +
>> +
>> +
>
> Reduce some of the excessive whitespace.

Will do with the other cosmetics.

>> +%macro HSUMPS 2 ; dst/src, tmp
>> +%if cpuflag(avx)
>> +  %if sizeof%1>=32  ; avx
>> +       vperm2f128   %2,   %1,   %1,   (0)*16+(1)   ;
>> %2=lo(%1)<<128+hi(%1)
>> +       vaddps       %1,   %2
>> +  %endif
>> +       vshufps      %2,   %1,   %1,   q1032
>> +       vaddps       %1,   %2
>> +       vshufps      %2,   %1,   %1,   q0321
>> +       vaddps       %1,   %2
>> +
>> +%else  ; this form is a bit faster than the short avx-like emulation.
>> +        movaps      %2,   %1            ;    [d,       c,       b,
>> a       ]
>> +        shufps      %1,   %1,   q1032   ; %2=[b,       a,       d,
>> c       ]
>> +        addps       %1,   %2            ; %1=[b+d,     a+c,     d+b,
>> c+a     ]
>> +        movaps      %2,   %1
>> +        shufps      %1,   %1,   q0321   ; %2=[c+a,     b+d,     a+c,
>> d+b     ]
>> +        addps       %1,   %2            ; %1=[c+a+b+d, b+d+a+c, a+c+d+b,
>> d+b+c+a ]
>> +        ; all %1 members should be equal for as long as float a+b==b+a
>> +%endif
>> +%endmacro
>
> Is reordering moves for the non-AVX path worth the additional
> complexity? Microoptimizations that only affect legacy hardware are
> IMO a bit questionable.

Yes, I'm on "legacy" hardware and the improvement is reliably measurable.

>> +%macro EMU_pblendvb 3 ; dst/src_a, src_b, mask
>> +%if cpuflag(avx)
>> +  %if cpuflag(avx) && notcpuflag(avx2) && sizeof%1 >= 32
>> +        %error AVX1 does not support integer 256 bit ymm operations
>> +  %endif
>> +
>> +       vpblendvb    %1,   %1,   %2,   %3
>> +;-------------------------
>> +%elif cpuflag(sse4)
>> +    %ifnidn %3,xmm0
>> +      %error sse41 blendvps uses xmm0 as default 3 operand, you used %3
>> +    %endif
>> +        pblendvb    %1,   %2,   %3
>> +;-------------------------
>> +%else
>> +        pxor        %2,   %1
>> +        pand        %2,   %3
>> +        pxor        %1,   %2
>> +%endif
>> +%endmacro
>
> The AVX1 integer warning is kind of silly and doesn't really serve any
> purpose.

As I've said in this thread before,
If you use avx1 ymm with this macro
you won't get any warning that avx2 opcode has been generated
because x86inc does not overload avx2.

I've tested it.

(And again, merge the improved avx2 x86inc)

>> +%macro EMU_broadcastss 2 ; dst, src
>> +%if cpuflag(avx2)
>> +       vbroadcastss %1,   %2            ; ymm, xmm
>> +;-------------------------
>> +%elif cpuflag(avx)
>> +  %ifnum sizeof%2       ; avx1 register
>> +       vpermilps    xmm%1, xmm%2, q0000 ; xmm, xmm, imm || ymm, ymm, imm
>> +    %if sizeof%1 >= 32  ; mmsize>=32
>> +       vinsertf128  %1,   %1,xmm%1,   1 ; ymm, ymm, xmm, im
>> +    %endif
>> +  %else                 ; avx1 memory
>> +       vbroadcastss %1,   %2            ; ymm, mm32 || xmm, mm32
>> +  %endif
>> +;-------------------------
>> +%else
>> +  %ifnum sizeof%2       ; sse register
>> +        shufps      %1,   %2,   %2,   q0000
>> +  %else                 ; sse memory
>> +        movss       %1,   %2            ; this zeroes the other 3
>> elements
>> +        shufps      %1,   %1,   0
>> +  %endif
>> +%endif
>> +%endmacro
>
> Use the existing x86util VBROADCASTSS macro. Modify it if it's lacking
> something.

The x86util VBROADCASTSS macro implements only
the avx1 variant that is memory to register.
My variant emulates the avx2 variant,
that also does reg to reg.

My variant also avoids write dependency on "movss reg, reg".
("movss reg, mem" clears the other elements, but
"movss reg, reg" preserves them. This creates dependency on the old
values of the register.)

And yes, I did ask if I should separate
these macros in another file and
if I should try to merge them in x86util.asm first.

What do you think on the matter?

> +%macro EMU_pbroadcastd 2 ; dst, src
> +%if cpuflag(avx2)
> +       vpbroadcastd %1,   %2
> +%elif cpuflag(avx) && mmsize >= 32
> +        %error AVX1 does not support integer 256 bit ymm operations
> +%else
> +  %ifnum sizeof%2       ; sse2 register
> +        pshufd      %1,   %2,   q0000
> +  %else                 ; sse memory
> +        movd        %1,   %2            ; movd zero extends
> +        pshufd      %1,   %1,   0
> +  %endif
> +%endif
> +%endmacro
>
> Same as above, but using the SPLATD macro.

Again, SPLATD does not handle avx2.
Also I do think it is better to use a name of existing instruction
instead of a made up one.


>> +; Merge parallel maximums final round (1 vs 1)
>> +        movaps      xmm0, xmm3          ; m0 = m3[0] = p[0]
>> +        shufps      xmm3, xmm3, q1111   ; m3 = m3[1] = p[1]
>> +        cmpless     xmm0, xmm3
>
> Use 3-arg instead of reg-reg movaps.

Do you have that in mind:
  shufps xm0, xm3,xm3, q1111
  cmpnless xm0, xm3

(seems to work fine).
Done.

> Use the actual cmpss instruction instead of psuedo-instructions.
> x86inc doesn't support automatic VEX handling for every possible
> psuedo-instruction (mainly because there's so many of them) which will
> result in AVX state transitions if INIT_YMM is used.

Not according to Intel Architecture Code Analyzer
and the benchmarks I've got.
The benchmarks showed that using cmpps is about 5% slower on Ryzen
and absolutely same speed on SkyLake.
That's in avx2 ymm.

Still I'll change it.
Done.


>> +%macro SET_PIC_BASE 3; reg, const_label
>> +%ifdef PIC
>> +        %{1}        %2,   [%3]    ; lea r5q, [rip+const]
>> +        %define     pic_base_%3 %2
>> +%else
>> +        %define     pic_base_%3 %3
>> +%endif
>> +%endmacro
> [...]
>> +        SET_PIC_BASE lea, r5q, const_align_abs_edge    ; rip+const
>> +        movups      m3,   [pic_base_const_align_abs_edge + r4q - mmsize]
>
> This macro is only ever used once, remove and inline it to reduce
> complexity.

Can I move it to x86util too?

Unified handling of PIC mangling is something welcomed.
I do think this is the right way. ;)
I think that at least one other file had something very similar.

I find explicit checks with ifdef/else/endif a lot more uglier
and hard to follow.

>> +%if num_mmregs > 8
>> +%define num_hireg_used 3
>> +%endif
>
> IIRC the number of used mmregs is ignored in 32-bit mode so I believe
> you can unconditionally specify 11.

Is that documented and guaranteed to not change in future?
I had it at 11 unconditionally. But then decided to play safe.

>> +        lea         r4q,  [Nq - mmsize] ; Nq is rounded up (aligned up)
>> to mmsize, so r4q can't become negative here, unless N=0.
>> +        movups      m2,   [inXq + r4q]
>> +        andps       m2,   m3
>> +        movaps      [tmpX + r4q], m2
>> +        movaps      m1,   m2            ; Sx
>
> Redundant reg-reg movaps, use m1 directly instead of using m2 at all.

Nice catch

Done.

>> +  %if PRESEARCH_ROUNDING == 0
>> +        cvttps2dq   m2,   m2            ; yt   = (int)truncf( res*X[i] )
>> +  %else
>> +        cvtps2dq    m2,   m2            ; yt   = (int)lrintf( res*X[i] )
>> +  %endif
>> +        paddd       m5,   m2            ; Sy  += yt
>> +        cvtdq2ps    m2,   m2            ; yt   = (float)yt
>
> Would using roundps instead of doing float-int-float conversion be
> beneficial?

Not on any CPU I got benchmarks from (And this includes Skylake and Ryzen).

When I rant about Intel making ops that are slower than
their multi op equivalents... this is one of my prime examples.


>> +        mulps       m1,   m2            ; m1   = X[i]*yt
>> +        movaps      [tmpY + r4q], m2    ; y[i] = m2
>> +        addps       m7,   m1            ; Sxy += m1;
>> +        mulps       m2,   m2            ; m2   = yt*yt
>> +        addps       m6,   m2            ; Syy += m2
>
> Potential use case for 2x FMA instead of 2x mul+add.

Hum...

FMULADD_PS   m7, m1, m2, m7,  m1
+mulps m1, m1, m2
+addps m7, m7, m1

FMULADD_PS   m6, m2, m2, m6,  m2
+mulps m2, m2, m2
+addps m2, m6, m2

FMA is newer than AVX1. To actually use the opcodes
I'd need to have 4'th version just for it or use avx2.

Without possibility of using FMA, the macro usage
would be just an obfuscation.

Do you insist on it?

>> +%%restore_sign_loop:
>> +        movaps      m0,   [tmpY + r4q]  ; m0 = Y[i]
>> +        movups      m1,   [inXq + r4q]  ; m1 = X[i]
>> +        andps       m1,   m2            ; m1 = sign(X[i])
>> +        orps        m0,   m1            ; m0 = Y[i]*sign
>> +        cvtps2dq    m3,   m0            ; m3 = (int)m0
>> +        movaps      [outYq + r4q], m3
>
> The orps instruction could be done using a memory operand instead of
> movaps+orps.

I think I had tried that, but it was slower on my CPU.

I redid the benchmarks results.

- synthetic K=3, N=176 :
old : 13170 13302 13340 13458 13562
new : 13215 13218 13273 13344 13805

- sample 1 (40min) default
old : 2293 2300 2328 2331 2350
new : 2305 2307 2325 2333 2334

- sample2 (2h) default
old : 2462 2468 2497 2519 2519
new : 2481 2486 2514 2529 2525

Do you insist on doing the change?


> AVX supports unaligned memory operands so the same could conditionally
> be done for the andps as well but that's a bit of a nit.
>
>> +%if ARCH_X86_64 == 0    ;sbrdsp
>> +        movss       r0m,  xmm6      ; return (float)Syy_norm
>> +        fld         dword r0m
>> +%else
>> +        movaps      m0,   m6        ; return (float)Syy_norm
>> +%endif
>> +        RET
>
> The reg-reg movaps on x64 could be optimized away by reordering
> registers so that the return value ends up in m0 instead of m6 from
> the start.

No, it cannot.
I need m0 for the blendv mask on sse4.

>> +INIT_XMM avx
>
> The code has a bunch of if conditions for ymm regs but ymm regs are
> never actually used?

I was asked to remove the INIT_YMM avx2 .

The benchmarks showed that the resulting code is
slightly slower in the default case, so I had the code disabled.
But a lot of people cannot stand having disabled code around. :|

If you want to help, you might try to find out why the code is slower.
My main theory is that 8 floats at once might be just too much,
90% of the searches are with N<32.

I don't have avx2 capable cpu to play with.

Best Regards


More information about the ffmpeg-devel mailing list