[FFmpeg-devel] [WIP][PATCH]v2 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Sat Jul 1 02:52:46 EEST 2017

On 6/26/17, Henrik Gramner <henrik at gramner.com> wrote:
> On Sat, Jun 24, 2017 at 10:39 PM, Ivan Kalvachev <ikalvachev at gmail.com>
> wrote:
>> +%define HADDPS_IS_FAST 0
>> +%define PHADDD_IS_FAST 0
> [...]
>> +        haddps      %1,   %1
>> +        haddps      %1,   %1
> [...]
>> +       phaddd       xmm%1,xmm%1
>> +       phaddd       xmm%1,xmm%1
>
> You can safely assume that those instructions are always slow and that
> this is virtually never the correct way to use them, so just use the
> shuffle + add method.

Well, there is always hope that they would do a direct implementation.
Or maybe AMD would?
Well, I guess I'll just remove it in the final version.

> You can unconditionally use non-destructive 3-arg instructions
> (without v-prefix) in non AVX-code to reduce ifdeffery. The x86inc
> abstraction layer will automatically insert register-register moves as
> needed.

I'm not sure what code do you have in mind for this comment.
If it is about the HSUMPS , I have to note that both variants
would generate different code (on sse).
The short avx form, would generate "movaps" and "shufps"
that have RW dependency. The sse form avoids that and
allows "movaps" and "shufps" to be executed in parallel.

I do use v prefix on non-destructive avx ops.
Should I prefer the non-prefixed version in such cases?
Should I v-prefix all, including vmovaps ?
(afair vmovaps and movaps differ in preserving/zeroing
of the high part of ymm register, when using xmm.)
I guess I should...

BTW, talking about overload macros.

"cmpps" is the only version of this opcode overloaded by x86asm.
Compilers also define "cmpleps" for "cmp less ps"
where the "less" part is been hardcoded as immediate operand.

Aka, "cmpleps m0, m1" is compiled as
"cmpps xmm0, xmm1, 1".

What complicates matter more is that for 256bit avx,
yasm refuses to accept the short form.
I have to use:
"cmpps m0, m0, m1, 1"
because
"cmpps m0, m1, 1" is turned into
"vcmpps ymm0, ymm1, 1" (line from *.dbg.asm)
and  YASM fails with:
"error: invalid combination of opcode and operands"
(nasm 2.13.1 seems ok).

> I'm a bit doubtful if it's worth the complexity to emulate 256-bit
> integer math using floating-point instruction hacks, especially since
> that's only relevant on two 5+ year old Intel µarchs (SNB & IVB). It's
> probably fine to simply require AVX2 if you need 256-bit integer SIMD.

Well, if 256bit code is 2x the speed of the 128bit one,
then it might make sense to try the emulation.
Unfortunately, even native avx2 seems slower.

> Be aware that most SSE SIMD instructions are actually implemented as
> x86inc macros and redefining them can have unexpected consequences and
> is therefore discouraged.

The manual says that redefinition is recursive
(aka expanded multiple times in the place of usage)
so the hacked integer code would use
the float overloaded macros (and it does).

Don't worry, I don't intend of leaving this in the final version.
But I'm interested to see how much faster/slower it is on real CPU.

Also, I've noticed that some kinds of int/float registry mixing
could have dramatic effect on speed, so trying the hacked
all float version is also a way to check if I have such bad mix.
(e.g. even pxor m0,m0 ; movaps m0,[const] might be problematic)

Best Regards.