[FFmpeg-devel] [PATCH] lpc: rewrite lpc_compute_autocorr in external asm

James Almer jamrial at gmail.com
Sun May 26 01:45:05 EEST 2024


On 5/25/2024 7:31 PM, James Almer wrote:
> On 5/25/2024 5:57 PM, Lynne via ffmpeg-devel wrote:
>> The inline asm function had issues running under checkasm.
>> So I came to finish what I started, and wrote the last part
>> of LPC computation in assembly.
>>
>> autocorr_10_c: 135525.8
>> autocorr_10_sse2: 50729.8
>> autocorr_10_fma3: 19007.8
>> autocorr_30_c: 390100.8
>> autocorr_30_sse2: 142478.8
>> autocorr_30_fma3: 50559.8
>> autocorr_32_c: 407058.3
>> autocorr_32_sse2: 151633.3
>> autocorr_32_fma3: 50517.3
>> ---
>>   libavcodec/x86/lpc.asm    | 91 +++++++++++++++++++++++++++++++++++++++
>>   libavcodec/x86/lpc_init.c | 87 ++++---------------------------------
>>   2 files changed, 100 insertions(+), 78 deletions(-)
>>
>> diff --git a/libavcodec/x86/lpc.asm b/libavcodec/x86/lpc.asm
>> index a585c17ef5..790841b7f4 100644
>> --- a/libavcodec/x86/lpc.asm
>> +++ b/libavcodec/x86/lpc.asm
>> @@ -32,6 +32,8 @@ dec_tab_sse2: times 2 dq -2.0
>>   dec_tab_scalar: times 2 dq -1.0
>>   seq_tab_sse2: dq 1.0, 0.0
>> +autoc_init_tab: times 4 dq 1.0
>> +
>>   SECTION .text
>>   %macro APPLY_WELCH_FN 0
>> @@ -261,3 +263,92 @@ APPLY_WELCH_FN
>>   INIT_YMM avx2
>>   APPLY_WELCH_FN
>>   %endif
>> +
>> +%macro COMPUTE_AUTOCORR_FN 0
>> +cglobal lpc_compute_autocorr, 4, 7, 8, data, len, lag, autoc, lag_p, 
>> data_l, len_p
> 
> Already mentioned, but it should be 3 not 8.
> 
>> +
>> +    shl lagd, 3
>> +    shl lenq, 3
>> +    xor lag_pq, lag_pq
>> +
>> +.lag_l:
>> +    movaps m8, [autoc_init_tab]
> 
> m2
> 
>> +
>> +    mov len_pq, lag_pq
>> +
>> +    lea data_lq, [lag_pq + mmsize - 8]
>> +    neg data_lq                     ; -j - mmsize
>> +    add data_lq, dataq              ; data[-j - mmsize]
>> +.len_l:
>> +    ; We waste the upper value here on SSE2,
>> +    ; but we use it on AVX.
>> +    movupd xm0, [dataq + len_pq]    ; data[i]
> 
> movsd
> 
>> +    movupd m1, [data_lq + len_pq]   ; data[i - j]
>> +
>> +%if cpuflag(avx)
> 
> %if mmsize == 32 here and everywhere else.
> 
>> +    vbroadcastsd m0, xm0
> 
> This is AVX2. AVX only has memory input argument. So use that and save 
> the movsd from above for the FMA3 version.
> 
>> +    vperm2f128 m1, m1, m1, 0x01
> 
> Aren't you loading 16 extra bytes for no reason if you're just going to 
> use the upper 16 bytes from the load above?

Nevermind, this is swapping lanes.

That aside, these versions are barely better and sometimes worse in all 
my tests on win64 with GCC with certain seeds.
For example, seed 4022958484 gives me:

autocorr_10_c: 21345.6
autocorr_10_sse2: 16434.6
autocorr_10_fma3: 24154.6
autocorr_30_c: 59239.1
autocorr_30_sse2: 46114.6
autocorr_30_fma3: 64147.1
autocorr_32_c: 63022.1
autocorr_32_sse2: 50040.1
autocorr_32_fma3: 66594.1

But seed 2236774811 gives me:

autocorr_10_c: 37135.3
autocorr_10_sse2: 26492.3
autocorr_10_fma3: 32943.3
autocorr_30_c: 102266.8
autocorr_30_sse2: 72933.3
autocorr_30_fma3: 85808.3
autocorr_32_c: 106537.8
autocorr_32_sse2: 77623.3
autocorr_32_fma3: 85844.3

But if i force len to always be 4999 instead of its value varying 
depending on seed, i consistently get things like:

autocorr_10_c: 40447.3
autocorr_10_sse2: 39526.8
autocorr_10_fma3: 42955.3
autocorr_30_c: 111362.3
autocorr_30_sse2: 111408.3
autocorr_30_fma3: 116781.8
autocorr_32_c: 122388.3
autocorr_32_sse2: 119125.3
autocorr_32_fma3: 114239.3

It would help if someone else could confirm this, but overall i don't 
see any worthwhile gain here. The old inline version, for those seeds 
where it worked, was somewhat faster.


More information about the ffmpeg-devel mailing list