[FFmpeg-devel] [PATCH] SIMD-optimized exponent_min() for ac3enc

Loren Merritt lorenm
Sun Jan 16 20:52:28 CET 2011

On Sun, 16 Jan 2011, Justin Ruggles wrote:

>>>> +    sub     offset1q, 256
>>>> +    cmp     offset1q, offsetq
>>> It is usually possible to arrange your pointers such that a loop ends with
>>> an offset of 0, and then you can take the flags from the add/sub instead of
>>> a separate cmp.
>> Or check for underflow.  ie jns
>>  sub     offset1q, 256
>>  js       next
>> top:
>>  ...
>>  sub     offset1q, 256
>>  jns      top
>> next:
> I don't think it's as simple as that for the inner loop in this case.
> It doesn't decrement to 0, it decrements to the first block.  If I make
> offset1 lower by 256 and decrement to 0 it works, but then I have to add
> 256 when loading from memory, and it ends up being slower than the way I
> have it currently.

The first iteration that doesn't run is when offset1q goes negative. 
That's good enough. Just remove the cmp and change jne to jae.

Or for the general case, don't undo the munging in the inner loop, munge 
the base pointer. Applying that to this function produces

cglobal ac3_exponent_min_%1, 3,4,1, exp, reuse_blks, offset, expn
     cmp  reuse_blksd, 0
     je .end
     sal  reuse_blksd, 8
     mov        expnd, reuse_blksd
     mov      offsetd, reuse_blksd
     mova          m0, [expq]
%ifidn %1, mmx
     PMINUB_MMX    m0, [expq+offsetq], m1
%else ; mmxext/sse2
     pminub        m0, [expq+offsetq]
     sub      offsetd, 256
     jae .nextblk
     mova      [expq], m0
     add         expq, mmsize
     sub        expnd, mmsize
     jae .nextexp

... which is 6x slower on Conroe x86_64, so I must have done something wrong.

--Loren Merritt

More information about the ffmpeg-devel mailing list