[FFmpeg-devel] [WIP] add sse4 flac lpc encoder
jamrial at gmail.com
Tue Feb 4 06:48:45 CET 2014
On 02/02/14 10:18 PM, James Darnley wrote:
> A rather hacked together patch adding an sse4 version of the flac lpc
> encoder for 16-bit samples, flac_lpc_encode_c_16(). But it works correctly.
> I have been using gprof to measure the time taken in functions.
>> Each sample counts as 0.01 seconds.
>> % cumulative self self total
>> time seconds seconds calls ms/call ms/call name
> Original code:
>> 43.94 19.45 19.45 flac_lpc_encode_c_16
> This patch:
>> 25.74 17.10 8.54 ff_flac_enc_lpc_16_sse4
> The fraction of total time is down from nearly half to just over a
> quarter. The time reported by `time` is also less these ~12 seconds.
> Original: 0m52.318s
> Patch: 0m40.198s
> These tests were done with compression level 8 which does skew the time
> spent in these functions to be in my favour.
> I already see that I can use 4 more xmm regs to unroll the loop more.
I tested just now, and the code is crashing for me.
> +INIT_XMM sse4
> +cglobal flac_enc_lpc_16, 3, 5, 4, 0, res, smp, coefs ; len, order, shift
You're calling the function with six arguments but this is only expecting
three. You're also reserving five general purpose registers instead of six.
> + ; r0 r1 r2 r3 r4 r5
> +%define posj r3
> +%define negj r4
> +movd m3, r5m ; shift
> + pxor m0, m0
> + xor posj, posj
> + xor negj, negj
You're losing the len and order values before using their registers as
You could do
cglobal flac_enc_lpc_16, 6, 8, 4, res, smp, coefs, len, order, shift, pos, neg
Above, and rename things accordingly. Though you'd be using eight registers,
making the code unsuitable for x86.
> + loop_order:
> + movd m2, [coefsq+posj*4] ; c = coefs[j]
> + SPLATD m2
> + movu m1, [smpq+negj*4-4] ; s = smp[i-j-1]
> + pmulld m1, m2
> + paddd m0, m1 ; p += c * s
> + add posj, 1
> + sub negj, 1
> + cmp posj, r4m
> + jne loop_order
> + psrad m0, m3 ; p >>= shift
> + movu m1, [smpq]
> + psubd m1, m0 ; smp[i] - p
> + movu [resq], m1 ; res[i] = smp[i] - (p >> shift)
> + add resq, mmsize
> + add smpq, mmsize
> + sub DWORD r3m, mmsize/4
> +jg loop_len
After changing what i mentioned above the code worked for me, though the speed
gains weren't as good in my tests compared to what you reported. (I however
used the default compression level).
More information about the ffmpeg-devel