# [FFmpeg-devel] [WIP][PATCH]v2 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Ivan Kalvachev ikalvachev at gmail.com
Sat Jun 24 23:39:03 EEST 2017

```This is the second version of my work.

Nobody posted any benchmarks, so
the old code remains for this round too.

The proper PIC handling code is included.

Small cosmetics, e.g. using tmpY,
to separate (semantically) from the output outY.

Now the tmpX buffer is fixed at 256*sizeof(float) size
and allocated by "cglobal" entry code.

The biggest changes are in the way distortion is computed.
The old code preferred rsqrt() approximation.
However I discovered that sometimes pulses are assigned
at positions where X[i]==0.0 . Padding is set to 0.0 in order
to avoid assigning pulses in it, and the code sometimes
assigned pulses there.

The new code solves this problem in 2 different ways:

1. checking for X[i]==0.0 and zeroing the numerator,
ensuring that it would never be bigger than p_max.
It's done by 3 more instructions in the "pulses add" inner loop.
(The "pulses sub" already have similar check that is sufficient).
This code is enabled for "USE_APPROXIMATION 2" case.

2. Improving precision.
Since input X[] is normalized,
the distortion is calculated by the formula:
d=2*(1-Sxy/sqrt(Syy))
where Sxy = Sum X[i]*Y[i]
and    Syy = Sum Y[i]*Y[i]

The old code (including C version from opus) calculate it
partially with p=Sxy/sqrt(Syy) or p^2=Sxy*Sxy/Syy .
(It does avoid division by replacing it with 2 multiplications
in a cross, aka a/b < c/d => a*d < c*b)
So p values are closing to 1, if we get p=1 we have perfect match.
The problem is that the more Sxy^2 is closer to Syy,
the less precision we get when (approximating) division.
To avoid that I tried to turn the formula into:
(sqrt(Syy)-Sxy)/sqrt(Syy)
in order to bring the nominator toward zero.
(As floats are normalized, this improves precision).

After some manual experimentation,
I came with the hacked formula:
(Syy-Sxy*Sxy)*approx(1/Syy)
that gives best results.

The final code uses the sign inverted formula (Sxy*Sxy-Syy)/Syy,
in order to preserve the comparison direction of the code.
Since "USE_APPROXIMATION 1" already uses similar formula
the new hack consists of placing a single subtraction in the inner loop.
Also since the formula is inverted its range starts at -1 and goes up,
so the p_max should start with a big enough negative value.

Approximation method #1 was slower than #2 and used to give same results.
However the improved variant gives result that I thought to be binary
identical to
the C version. (Well I found at least a single case where it is not).

I'm inclined to pick "USE_APPROXIMATION 1" as the default method,
even if #2 might still be faster, just because it provides better trade-off
between precision and speed.

At my benchmarks the pvq_search_sse42 is about 2x the speed of
the current C implementation. The v1 was closer to 2.5x .

I'd be glad to see some benchmarks,
preferably with different defines enabled and disabled,
so I can tune the code for different CPU's.

Best Regards
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-SIMD-opus-pvq_search-implementation-v2.patch
Type: text/x-patch
Size: 27415 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20170624/065b3c0d/attachment.bin>
```