[FFmpeg-devel] [PATCH] SIMD-optimized exponent_min() for ac3enc

Justin Ruggles justin.ruggles
Mon Jan 17 01:14:48 CET 2011

On 01/16/2011 02:52 PM, Loren Merritt wrote:

> On Sun, 16 Jan 2011, Justin Ruggles wrote:
>>>>> +    sub     offset1q, 256
>>>>> +    cmp     offset1q, offsetq
>>>> It is usually possible to arrange your pointers such that a loop ends with
>>>> an offset of 0, and then you can take the flags from the add/sub instead of
>>>> a separate cmp.
>>> Or check for underflow.  ie jns
>>>  sub     offset1q, 256
>>>  js       next
>>> top:
>>>  ...
>>>  sub     offset1q, 256
>>>  jns      top
>>> next:
>> I don't think it's as simple as that for the inner loop in this case.
>> It doesn't decrement to 0, it decrements to the first block.  If I make
>> offset1 lower by 256 and decrement to 0 it works, but then I have to add
>> 256 when loading from memory, and it ends up being slower than the way I
>> have it currently.
> The first iteration that doesn't run is when offset1q goes negative. 
> That's good enough. Just remove the cmp and change jne to jae.

The first iteration that doesn't run is when offset1q == offsetq, and
offsetq is always 0 to [80..256]-mm_size.

> Or for the general case, don't undo the munging in the inner loop, munge 
> the base pointer. Applying that to this function produces
> %macro AC3_EXPONENT_MIN 1
> cglobal ac3_exponent_min_%1, 3,4,1, exp, reuse_blks, offset, expn
>      cmp  reuse_blksd, 0
>      je .end
>      sal  reuse_blksd, 8
>      mov        expnd, reuse_blksd
> .nextexp:
>      mov      offsetd, reuse_blksd
>      mova          m0, [expq]
> .nextblk:
> %ifidn %1, mmx
>      PMINUB_MMX    m0, [expq+offsetq], m1
> %else ; mmxext/sse2
>      pminub        m0, [expq+offsetq]
> %endif
>      sub      offsetd, 256
>      jae .nextblk
>      mova      [expq], m0
>      add         expq, mmsize
>      sub        expnd, mmsize
>      jae .nextexp
> .end:
>      REP_RET
> %endmacro
> ... which is 6x slower on Conroe x86_64, so I must have done something wrong.

Yeah, it's wrong in several ways.  The outer loop is supposed to run
offset/mmsize times (offset is 80 to 256), step mmsize.  The inner loop
is supposed to run reuse_blks times, step 256, for each outer loop

Reversing the outer loop seems unrelated to what you've mentioned.  I
don't see how it helps.  Is it actually faster to have an extra add
instead of an offset in the load and store?

I think I get what you mean about adjusting base pointer though.  I'll
try it.


More information about the ffmpeg-devel mailing list