[FFmpeg-devel] [PATCH 0/6] x86 SIMD for dirac 10-bit wavelet transforms

Wed Jul 25 15:21:30 EEST 2018

On 2018-07-19 17:23, Rostislav Pehlivanov wrote:
> Could you provide standard overall transform results using START/STOP_TIMER
> rather than overall decoding speed?

Ask and ye shall receive.

> haar horizontal compose
>     sse2: 3.67x faster (45248±108.1 vs. 12328±21.1 decicycles) compared with none
>     avx:  3.74x faster (45248±108.1 vs. 12091±11.0 decicycles) compared with none
>     avx2: 5.14x faster (45248±108.1 vs. 8805±15.6 decicycles) compared with none
> haar vertical compose
>     sse2: 1.57x faster (31771±459.9 vs. 20179±786.2 decicycles) compared with none
>     avx:  1.62x faster (31771±459.9 vs. 19572±253.1 decicycles) compared with none
>     avx2: 1.73x faster (31771±459.9 vs. 18337±827.9 decicycles) compared with none
> 
> legall vertical hi
>     sse2: 3.68x faster (20506±46.2 vs. 5574±29.7 decicycles) compared with none
>     avx2: 5.96x faster (20506±46.2 vs. 3442±32.7 decicycles) compared with none
> legall vertical lo
>     sse2: 1.52x faster (28360±178.6 vs. 18603±114.8 decicycles) compared with none
>     avx2: 1.64x faster (28360±178.6 vs. 17255±372.3 decicycles) compared with none
> 
> dd97 vertical hi
>     sse2: 2.76x faster (31975±103.0 vs. 11570±247.5 decicycles) compared with none
>     avx:  2.82x faster (31975±103.0 vs. 11346±179.0 decicycles) compared with none
>     avx2: 3.83x faster (31975±103.0 vs. 8357±219.6 decicycles) compared with none
> dd97 vertical lo
>     sse2: 1.52x faster (29476±335.8 vs. 19429±518.7 decicycles) compared with none
>     avx2: 1.62x faster (29476±335.8 vs. 18246±559.8 decicycles) compared with none

Here "none" refers to the C functions, from "-cpuflags none" option.

I also have the results of removing the C wrappers from these functions,
except dd97.  They aren't that much better.

> haar horizontal compose
>     sse2: 3.68x faster (45143±36.4 vs. 12279±16.4 decicycles) compared with none
>     avx:  3.68x faster (45143±36.4 vs. 12275±9.2 decicycles) compared with none
>     avx2: 5.16x faster (45143±36.4 vs. 8742±12.3 decicycles) compared with none
> haar vertical compose
>     sse2: 1.64x faster (31792±367.5 vs. 19377±271.7 decicycles) compared with none
>     avx:  1.58x faster (31792±367.5 vs. 20090±593.9 decicycles) compared with none
>     avx2: 1.66x faster (31792±367.5 vs. 19157±1352.4 decicycles) compared with none
> 
> legall vertical hi
>     sse2: 3.86x faster (20201±26.5 vs. 5231±39.0 decicycles) compared with none
>     avx2: 6.70x faster (20201±26.5 vs. 3014±39.1 decicycles) compared with none
> legall vertical lo
>     sse2: 1.50x faster (28345±206.6 vs. 18908±440.3 decicycles) compared with none
>     avx2: 1.63x faster (28345±206.6 vs. 17361±637.9 decicycles) compared with none

I will squash patches, update commit messages, and send a new patch thread.