[FFmpeg-devel] [PATCH] h264.c/decode_cabac_residual optimization

Siarhei Siamashka siarhei.siamashka
Wed Jul 2 11:45:37 CEST 2008


On Wed, Jul 2, 2008 at 1:00 AM, M?ns Rullg?rd <mans at mansr.com> wrote:
> "Siarhei Siamashka" <siarhei.siamashka at gmail.com> writes:
[...]
>> Typically pre-decrement is always preferred in code optimized for
>> performance as it is generally faster. Something like this would be
>> better (also it is closer to the old code):
>> while( --coeff_count >= 0 ) {
>> ...
>> }
>>
>> You can try to compile this sample with the best possible
>> optimizations, look at the assembly output and check where the
>> generated code is better and why:
>>
>> /**********************/
>> int q();
>>
>> void f1(int n)
>> {
>>     while (--n >= 0) {
>>         q();
>>     }
>> }
>>
>> void f2(int n)
>> {
>>     while (n--) {
>>         q();
>>     }
>> }
>> /**********************/
>
> Any half-decent compiler should generate the same code for those two
> functions.

That's not true, just because these two functions are not identical.
Hint: what happens if you pass -1 or any other negative value to these
functions?

> GCC for ARM generates a slightly different, but equivalent, setup sequence, and the loops are exactly the same.

In my case, gcc 3.4.4 (using '-march=armv6 -O3 -c' options) generated
the following assembly output, which is definitely better for 'f1' (3
instructions in the inner loop instead of 4):

00000000 <f1>:
   0:   e92d4010        stmdb   sp!, {r4, lr}
   4:   e2504001        subs    r4, r0, #1      ; 0x1
   8:   48bd8010        ldmmiia sp!, {r4, pc}
   c:   ebfffffe        bl      0 <q>
  10:   e2544001        subs    r4, r4, #1      ; 0x1
  14:   5afffffc        bpl     c <f1+0xc>
  18:   e8bd8010        ldmia   sp!, {r4, pc}

0000001c <f2>:
  1c:   e92d4010        stmdb   sp!, {r4, lr}
  20:   e2504001        subs    r4, r0, #1      ; 0x1
  24:   38bd8010        ldmccia sp!, {r4, pc}
  28:   e2444001        sub     r4, r4, #1      ; 0x1
  2c:   ebfffffe        bl      0 <q>
  30:   e3740001        cmn     r4, #1  ; 0x1
  34:   1afffffb        bne     28 <q+0x28>
  38:   e8bd8010        ldmia   sp!, {r4, pc}

I'm curious, what is the output of your compiler?

> I can't be bothered to check x86.

But I can. For this particular case, the difference between the
following variants in 'decode_cabac_residual' is the following:
"while( --coeff_count >= 0 ) { ... }"

...
    3022:   66 89 04 4a             mov    %ax,(%edx,%ecx,2)
    3026:   83 6c 24 1c 04          subl   $0x4,0x1c(%esp)
    302b:   83 6c 24 0c 01          subl   $0x1,0xc(%esp)
    3030:   0f 89 06 fe ff ff       jns    2e3c <decode_cabac_residual+0x42d>
    3036:   e9 d3 01 00 00          jmp    320e <decode_cabac_residual+0x7ff>
    303b:   8b 54 24 08             mov    0x8(%esp),%edx
    303f:   81 c2 bc 1d 02 00       add    $0x21dbc,%edx
...

"while( coeff_count-- ) { ... }"

...
    3022:   66 89 04 4a             mov    %ax,(%edx,%ecx,2)
    3026:   83 6c 24 1c 04          subl   $0x4,0x1c(%esp)
    302b:   83 6c 24 0c 01          subl   $0x1,0xc(%esp)
>    3030:   83 7c 24 0c ff          cmpl   $0xffffffff,0xc(%esp)
    3035:   0f 85 01 fe ff ff       jne    2e3c <decode_cabac_residual+0x42d>
    303b:   e9 d3 01 00 00          jmp    3213 <decode_cabac_residual+0x804>
    3040:   8b 54 24 08             mov    0x8(%esp),%edx
    3044:   81 c2 bc 1d 02 00       add    $0x21dbc,%edx
...

The expression 'while( coeff_count-- )' has one extra instruction
inside of the loop in 'decode_cabac_residual', also increasing the
size of the function by 5 bytes. The compiler seems to internally
convert it into 'while( --coeff_count != -1 )', which is less
efficient.

Compiled FFmpeg on Pentium-M with gcc 4.2.3 using just './configure &&
make', let me know if you get different results with other versions of
gcc or other optimization options.

Of course, benchmarking with 'decizycles' can hardly reliable detect
the difference in just 1 instruction, also gcc may generate different
code for the other part of the source as a side effect, but they are
unrelated to "while( coeff_count-- ) { ... }" vs. "while(
--coeff_count >= 0 ) { ... }" case.



More information about the ffmpeg-devel mailing list