[FFmpeg-devel] [PATCH] h264.c/decode_cabac_residual optimization

Wed Jul 2 12:54:31 CEST 2008

On Wed, Jul 2, 2008 at 12:37 PM, M?ns Rullg?rd <mans at mansr.com> wrote:
>>> 0000001c <f2>:
>>>  1c:   e92d4010        stmdb   sp!, {r4, lr}
>>>  20:   e2504001        subs    r4, r0, #1      ; 0x1
>>>  24:   38bd8010        ldmccia sp!, {r4, pc}
>>>  28:   e2444001        sub     r4, r4, #1      ; 0x1
>>>  2c:   ebfffffe        bl      0 <q>
>>>  30:   e3740001        cmn     r4, #1  ; 0x1
>>>  34:   1afffffb        bne     28 <q+0x28>
>>>  38:   e8bd8010        ldmia   sp!, {r4, pc}
>>>
>>> I'm curious, what is the output of your compiler?
>>
>> CSL 2007q3 and 2008q1 both generate this:
>>
>> 00000000 <f2>:
>>    0:   e92d4070        push    {r4, r5, r6, lr}
>>    4:   e2505000        subs    r5, r0, #0      ; 0x0
>>    8:   08bd8070        popeq   {r4, r5, r6, pc}
>>    c:   e3a04000        mov     r4, #0  ; 0x0
>>   10:   e2844001        add     r4, r4, #1      ; 0x1
>>   14:   ebfffffe        bl      0 <q>
>>   18:   e1540005        cmp     r4, r5
>>   1c:   1afffffb        bne     10 <f2+0x10>
>>   20:   e8bd8070        pop     {r4, r5, r6, pc}
>>
>> 00000024 <f1>:
>>   24:   e3500001        cmp     r0, #1  ; 0x1
>>   28:   e92d4070        push    {r4, r5, r6, lr}
>>   2c:   e1a05000        mov     r5, r0
>>   30:   48bd8070        popmi   {r4, r5, r6, pc}
>>   34:   e3a04000        mov     r4, #0  ; 0x0
>>   38:   e2844001        add     r4, r4, #1      ; 0x1
>>   3c:   ebfffffe        bl      0 <q>
>>   40:   e1540005        cmp     r4, r5
>>   44:   1afffffb        bne     38 <q+0x38>
>>   48:   e8bd8070        pop     {r4, r5, r6, pc}
>
> That's exactly what I got too.  It's curious that it saves r6, even
> though it is never used.  Perhaps it does this to keep the stack
> 8-byte aligned.

Certainly since it's an EABI requirement.  It's probably
faster to do it this way than to correct the SP with an
explicit sub instruction.

It should also be noted that conditionnaly executing
ld/st instructions is not a very good idea in general:
if your processor is heavily speculating, you might
stall the pipeline at that point or at the next ld/st.
In that case that should be OK since there is no
other ld/st close after the popmi, and the popmi has
been scheduled a few instructions after the cmp.

> Also curious is why r4 and r5 are used, rather than
> the callee-saved r1 and r2.  What a waste of 4 bytes stack space.

That's what I thought too.  I am wondering if it's a property
of gcc 4.2 itself or of the ARM back-end.

Laurent