[FFmpeg-devel] [PATCH 04/10] lavc/flacenc: add sse4 version of the lpc encoder

Wed Feb 12 16:11:54 CET 2014

On 2014-02-12 12:41, Christophe Gisquet wrote:
> Hi,
> 
> 2014-02-12 0:11 GMT+01:00 James Darnley <james.darnley at gmail.com>:
> 
>> +%if ARCH_X86_64
>> +    cglobal flac_enc_lpc_16, 6, 8, 4, 0, res, smp, len, order, coefs, shift
>> +    %define posj r6
>> +    %define negj r7
>> +%else
>> +    cglobal flac_enc_lpc_16, 6, 6, 4, 0, res, smp, len, order, coefs, shift
>> +    %define posj r2
>> +    %define negj r5
>> +%endif
> [...]
>> +movd m3, shiftmp
> 
> If I'm not mistaken and x264asm isn't already brighter than me, you're
> forcing the loading of shift into a gpr, while you really never have
> to.
> This 6th register will always be on stack, so you need one less gpr in
> all cases.

As I understand it, nix64 has it in a register.  I think that is what
libavutil/x86/x86inc.asm:501 says anyway.

I just ended up all the args loaded because I tried on Win64 and I saw
that I got "cmp R9, R9" at one point despite me thinking I had a
register and a memory location.

> I'm not sure, but is it possible to leave order or len wherever they
> are for x86, so as to save another gpr? That may require to manually
> load the args.

I will look again, more closely, to see if I can reduce the number of
registers used.  I think an easy way to do this to re-order the
arguments so the pointers can all go at the beginning.

>> +.looplen:
>> +    pxor m0,  m0
>> +    xor posj, posj
>> +    xor negj, negj
>> +    .looporder:
>> +        movd   m2, [coefsq+posj*4] ; c = coefs[j]
>> +        SPLATD m2
>> +        movu   m1, [smpq+negj*4-4] ; s = smp[i-j-1]
>> +        pmulld m1,  m2
>> +        paddd  m0,  m1             ; p += c * s
>> +
>> +        add posj, 1
>> +        sub negj, 1
>> +        cmp posj, ordermp
>> +    jne .looporder
> 
> Potentially stupid question: do the add and sub gets compiled to
> inc/dec ? Is there a benefit compared to adding/subtracting 4? (I
> guess it does)
> Also, maybe not worthwhile, coefsq could be incremented by orderq*4,
> posj set to -orderq, and then you would do:
> dec negj
> inc posj
> jl/jnz .looporder

No they don't get reduced to inc and dec.

In my first, non-public attempt at this I did loop over decreasing order
but my code produced completely wrong results.  I could look again at
doing this now that I have working code and I "decoded" the algorithm
from the C code.

>> +    movu  [resq], m1               ; res[i] = smp[i] - (p >> shift)
>> +
>> +    add resq, mmsize
>> +    add smpq, mmsize
>> +    sub lenmp, mmsize/4
>> +jg .looplen
> 
> Equivalent trick here if len is in a reg: add 4*len*mmsize to resq,
> neg lenq then:
> movu  [resq+4*lenq], m1
> add smpq, mmsize
> add lenq, mmsize/4
> jg .looplen
> There are probably errors in what I gave, but this should be
> sufficient to give you the idea.

Yes, I think so.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 683 bytes
Desc: OpenPGP digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140212/ea9c6da4/attachment.asc>