[FFmpeg-devel] [PATCH]v5 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Sat Jul 22 14:18:30 EEST 2017

This patch is ready for review and inclusion.

Explanation of what it does and how it works
could be found in the previous WIP threads:
[v1] http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212146.html
[v2] http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212816.html
[v3] http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213030.html
[v4] http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213436.html

The changes compared to WIP v4 are small:
 - Using r4d ops to clear the high bits on int32 arguments.
 - Correctly map the cglobal registry usage.
 - Use SSE4 instead of SSE42, since blend is only SSE4.1.
 - Fix building with --disable-x86asm .

 - Remove testing defines.
Loading constants in registers is (now) always same or better speed.
Avoiding stall forwarding is faster on all CPU except Ryzen.
On Ryzen the alternative is about 7 cycles faster, that's why
I've left the code disabled, but without define.
I've also left the two other defines, as they are useful
for debugging and creating binary identical results
to other algorithms.

 - Disable the 256bit AVX2 variant usage.
I'm leaving the code in the assembly as disabled,
in case it is useful in future.

---

I'm including some of the benchmarks.
Some data is removed, since it was used to test different methods.
Benchmarks are done at default settings (96kbps),
but with different samples. All samples are above 1h long.

In summary, the function is about 2-3x faster
than the improved FFmpeg C version.

===========================================================
 K10  AMD Phenom(tm) II X4 945 Processor
//v4
      706   706   706   706   706            // NULL
     4146  4161  4169  4184  4188  4328 4379 // SSE2
     4988  5015  5016  5030  5185            // USE_APPROXIMATION  0
    13860 13828 13846 13846 13831            // C

===========================================================
 Pentium Dual Core E5800
//V4
    3006 3012 3019 3023 3025 // SSE2
    9066 9071 9074 9077 9081 // C

//===========================================================
 Ryzen 1800X
//v3
     357                     // NULL
    1999 2001 2004           // AVX1 GCC
    2010 2029                // SSE4 MSVC
    2012 2026 2027           // AVX1 MSVC
    2166 2170 2171           // AVX2 & STALL_WRITE_FORWARDING 1
    2176 2179 2180 2180 2189 // AVX2
    2226 2230 2234           // AVX2 & USE_APPROXIMATION 0
    6216 6162 6162           // C only GCC
   61909 61545               // C only MSVC
//v4
    1931 1933 1935           // v4 AVX1
    2096 2097 2098           // v4 AVX2 & STALL_WRITE_FORWARDING 1
    2103 2110 2112           // v4 AVX2

//===========================================================
 Intel(R) Core(TM) i7-3930K CPU
//v3
     272            // NULL
    1755 1756 1764  // AVX1
    1847 1855 1866  // SSE4
    2003 2009       // USE_APPROXIMATION  00
    2103 2110 2112  // AVX2
    4855 4856       // C only

//===========================================================
 SkyLake i7 6700HQ
//v2
     264                        // NULL
    1764 1765 1772 1773 1780    // SSE4
    1782 1782 1787 1795 1796    // AVX1
    1805 1807 1807 1811 1815    // AVX1 & USE_APPROXIMATION 0
    1826 1827 1828 1833 1833    // SSE2
    1850 1853 1857 1857 1868    // AVX2
    6878 6934 6879 6921 6899    // C

-b:a 48kbps, 96kbps, 510kbps
sse4:  2049,   1826,     955
sse2:  2065,   1874,     943
avx:   2106,   1868,     950
c:     9202,   7080,    1392
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-SIMD-opus-pvq_search-implementation.patch
Type: text/x-patch
Size: 24414 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20170722/c9ec5d19/attachment.bin>