[FFmpeg-devel] [WIP][PATCH] Opus Piramid Vector Quantization Search in x86 SIMD asm

Fri Jun 9 13:08:48 EEST 2017

On Fri, Jun 09, 2017 at 01:36:07AM +0300, Ivan Kalvachev wrote:
> On request by Rostislav Pehlivanov (atomnuker), I've been working on
> SSE/AVX accelerated version of pvq_search().
> 
> The attached patch is my work so far.
> 
> At the moment, for me, at the default bitrate
> the function is 2.5 times faster than the current C version.
> (My cpu is Intel Westmere class.)
> The total encoding time is about 5% faster.
> 
> I'd like some more benchmarks on different CPUs
> and maybe some advises how to improve it.
> (I've left some alternative methods in the code
> that could be easily switched with defines, as they
> may be faster on other cpu's).
> (I'm also quite afraid how fast it would run on pre-Ryzen AMD CPUs)
> 
> 
> The code generates 4 variants: SSE2, SSE4.2, AVX and 256bit AVX2.
> I haven't tested the AVX myself on real CPU,
> I used Intel SDE to develop and test them.
> Rostislav (atomnuker) reported some crashes with the 256bit AVX2,
> that however might be related to clang tools.
> 
> 
> 
> 
> Bellow are some broad descriptions:
> 
> The typical use of the function for the default bitrate (96kbps) is
> K<=36 and N<=32.
> N is the size of the input array (called vector), K is number of
> pulses that should be present in the output (The sum of output
> elements is K).
> 
> In synthetic tests, the SIMD function could be 8-12 times faster with
> the maximum N=176.
> I've been told that bigger sizes are more common at low bitrate encodes and
> will be more common with the upcoming RDO improvements.
> 
> A short description of the function working:
> 1. Loop that calculates sum (Sx) of the input elements (inX[]). The
> loop is used to fill a stack allocated temp buffer (tmpX) that is
> aligned to mmsize and contains absolute values of inX[i].
> 
> 2. Pre-Search loop. It uses K/Sx as approximation for the vector gain
> and fills output vector outY[] based on it. The output is in integers,
> but we use outY[] to temporally store the doubled Y values as floats.
> (We need 2*Y for calculations). This loop also calculates few
> parameters that are needed for the distortion calculations later (Syy=
> Sum of inY[i]^2 ; Sxy=Sum inX[i]*outY[i] )
> 
> 3. Adding of missing pulses or Elimination of extra ones.
> The C function uses variable "phase" to signal if pulses should be
> added or removed, I've separated this to separate cases. The code is
> shared through a macro PULSES_SEARCH .
> Each case is formed by 2 loops. The outer loop is executed until we
> have K pulses in the output.
> The inner is calculating the distortion parameter for each element and
> picking the best one.
> (parallel search, combination of parallel results, update of variables).
> 
> 4. When we are done we do one more loop, to convert outY[] to single
> integer and to restore its sign (by using the original inX[]).
> 
> 5. There is special case when Sx==0, that happens if all elements of
> the input are zeroes (in theory the input should be normalized, that
> means Sum of X[i]^2 == 1.0). In this case return zero output and 1.0
> as gain.
> 
> ---
> Now, I've left some defines that change the generated code.
> 
> HADDPS_IS_FAST
> PHADDD_IS_FAST
> I've implemented my own horizontal sums macros, and while doing it, I
> have discovered that on my CPU (Westmere class) the use of "new"
> SSE4.2 instructions are not faster than using my own code for doing
> the same.
> It's not speed critical, since horizontal sums are used 3-4 times per
> function call.
> 
> BLENDVPS_IS_FAST
> PBLENDVB_IS_FAST
> I think that blend on my CPU is faster than the alternative version
> that I've implemented. However I'm not sure this is true for all
> CPU's, since a number of modern cpu have latency=3 and
> inv_throughput=2 (that's 2 clocks until another blend could start).
> 
> CONST_IN_X64_REG_IS_FASTER
> The function is implemented so only 8 registers are used. With this
> define constants used during PULSES_SEARCH are loaded in the high
> registers available on X64. I could not determine if it is faster to
> do so... it should be, but sometimes I got the opposite result.
> I'd probably enable it in the final version.
> 
> STALL_WRITE_FORWARDING
> After the inner search finds the maximum, we add/remove pulse in
> outY[i]. Writing single element (sizeof(float)=4) however could block
> the long load done in the inner loop (mmsize=16). This hurts a lot
> more on small vector sizes.
> On Skylake the penalty is only 11 cycles, while Ryzen should have no
> penalty at all. Older CPU's can have penalty of up to 200 cycles.
> 
> SHORT_SYY_UPDATE
> This define has meaning only when the STALL* is 0 (aka have the longer
> code to avoid stalls).
> It saves few instructions by loading old outY[] value by scalar load,
> instead of using HSUMPS and some 'haddps' to calculate them.
> So far it looks like the short update is always faster, but I've left
> it just in case...
> 
> USE_APPROXIMATION
> This controls the method used for calculation of the distortion parameter.
> "0" means using 1 multiplication and 1 division, that could be a lot
> slower (14;14 cycles on my CPU, 11;7 on Skylake)
> "1" uses 2 multiplications and 1 reciprocal op that is a lot faster
> than real division, but gives half precision.
> "2" uses 1 multiplication and 1 reciprocal square root op, that is
> literally 1 cycle, but again gives half precision.
> 
> PRESEARCH_ROUNDING
> This control the rounding of the gain used for guess.
> "0" means using truncf() that makes sure that the pulses would never
> be more than K.
> It gives results identical to the original celt_* functions
> "1" means using lrintf(), this is basically the improvement of the
> current C code over the celt_ one.
> 
> 
> ALL_FLOAT_PRESEARCH
> The presearch filling of outY[] could be done entirely with float ops
> (using SSE4.2 'roundps' instead of two cvt*).  It is mostly useful if
> you want to try YMM on AVX1 (AVX1 lacks 256 integer ops).
> For some reason enabling this makes the whole function 4 times slower
> on my CPU. ^_^
> 
> I've left some commented out code. I'll remove it for sure in the final version.
> 
> I just hope I haven't done some lame mistake in the last minute...

>  opus_pvq.c              |    9 
>  opus_pvq.h              |    5 
>  x86/Makefile            |    1 
>  x86/opus_dsp_init.c     |   47 +++
>  x86/opus_pvq_search.asm |  597 ++++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 657 insertions(+), 2 deletions(-)
> 3b9648bea3f01dad2cf159382f0ffc2d992c84b2  0001-SIMD-opus-pvq_search-implementation.patch
> From 06dc798c302e90aa5b45bec5d8fbcd64ba4af076 Mon Sep 17 00:00:00 2001
> From: Ivan Kalvachev <ikalvachev at gmail.com>
> Date: Thu, 8 Jun 2017 22:24:33 +0300
> Subject: [PATCH 1/3] SIMD opus pvq_search implementation.

seems this breaks build with mingw64, didnt investigate but it
fails with these errors:

libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
make: *** [ffmpeg_g.exe] Error 1
make: *** Waiting for unfinished jobs....
make: *** [ffprobe_g.exe] Error 1

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Democracy is the form of government in which you can choose your dictator
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20170609/30f070df/attachment.sig>