[FFmpeg-devel] [PATCH] h264.c/decode_cabac_residual optimization

Alexander Strange astrange
Fri Jul 4 07:32:47 CEST 2008


On Jul 1, 2008, at 5:23 PM, Alexander Strange wrote:

>
> On Jul 1, 2008, at 2:44 PM, Michael Niedermayer wrote:
>
>> On Tue, Jul 01, 2008 at 01:14:56PM -0400, Alexander Strange wrote:
>>>
>>> 1 - Don't check for encoding in hl_decode_mb, since there's no  
>>> encoder.
>>> 2 - Turn off cloning hl_decode_mb under CONFIG_SMALL.
>>> 3 - Clone decode_cabac_residual depending on cat. This gets rid of  
>>> several
>>> cat checks and the if (qmul) for every coeff.
>>>
>>> On Core 2:
>>> Before-
>>> 24205 dezicycles in decode_cabac_residual, 1048505 runs, 71 skips
>>> 23079 dezicycles in decode_cabac_residual, 2096992 runs, 160 skips
>>> 22186 dezicycles in decode_cabac_residual, 4193933 runs, 371 skips
>>> 22444 dezicycles in decode_cabac_residual, 8387862 runs, 746 skips
>>>
>>> After-
>>> 24035 dezicycles in decode_cabac_residual, 1048502 runs, 74 skips
>>> 22922 dezicycles in decode_cabac_residual, 2096981 runs, 171 skips
>>> 22037 dezicycles in decode_cabac_residual, 4193955 runs, 349 skips
>>> 22293 dezicycles in decode_cabac_residual, 8387927 runs, 681 skips
>>>
>>> cabac_residual's time is definitely not normally distributed, so I  
>>> wouldn't
>>> trust the cycle counts for anything, but at least it went down.
>>>
>>> 4 - reindention
>>> 5 - Simplify "for( coeff_count--; coeff_count >= 0;  
>>> coeff_count-- )".
>>> 6 - Reorder the cat checks based on frequency, mostly from  
>>> counting block
>>> types in x264 output. gcc statically predicts if == to be not  
>>> taken, so the
>>> more common one should be in the else block. Moving cat == 5 to  
>>> the top of
>>> the if/elses also makes it more obvious that it can be merged into  
>>> the
>>> significance map.
>>> 7 - Use get_cabac_bypass_sign in both branches of the loop.
>>>
>>> Before-
>>> 24043 dezicycles in decode_cabac_residual, 524159 runs, 129 skips
>>> 24462 dezicycles in decode_cabac_residual, 1048335 runs, 241 skips
>>> 23323 dezicycles in decode_cabac_residual, 2096708 runs, 444 skips
>>> 22416 dezicycles in decode_cabac_residual, 4193483 runs, 821 skips
>>>
>>> After-
>>> 23318 dezicycles in decode_cabac_residual, 524112 runs, 176 skips
>>> 23712 dezicycles in decode_cabac_residual, 1048286 runs, 290 skips
>>> 22608 dezicycles in decode_cabac_residual, 2096607 runs, 545 skips
>>> 21731 dezicycles in decode_cabac_residual, 4193412 runs, 892 skips
>>>
>>> There are some interesting things to look at after these-
>>> - gcc pointlessly unrolls the "while( coeff_abs < 15 &&  
>>> get_cabac( CC, ctx
>>> ) )" loop into taking up about half of the compiled function when  
>>> x86 asm
>>> is used, so it could be rewritten in asm to fix that.
>>> - get_cabac_bypass* could use cmov, and refill() is so small it  
>>> might take
>>> less time than a branch mispredict penalty.
>>> - The asm operands for get_cabac itself could be relaxed (not  
>>> requiring
>>> output to eax, using +m/+g, that kind of thing). I tried this  
>>> earlier and
>>> got bad results, but maybe it'll work better after this.
>>> - refill* should use AV_RB16.
>>> - I'm not sure what CABAC_ON_STACK is used for.
>>>
>>
>>> But I don't want to spend too much time not doing SoC - someone  
>>> else can
>>> try those if they want.
>>
>> fine, whoever wants to work on this should post
>> proper benchmarks, not 1 run after half a dozen changes
>
> BTW, that benchmark was only for the last patch.
>
>>> diff -ru --exclude='*svn*' ffmpeg-/libavcodec/h264.c ffmpeg/ 
>>> libavcodec/h264.c
>>> --- ffmpeg-/libavcodec/h264.c	2008-06-30 14:47:21.000000000 -0400
>>> +++ ffmpeg/libavcodec/h264.c	2008-06-30 14:47:36.000000000 -0400
>>> @@ -2726,10 +2726,7 @@
>>>    MpegEncContext * const s = &h->s;
>>>    const int mb_xy= h->mb_xy;
>>>    const int mb_type= s->current_picture.mb_type[mb_xy];
>>> -    int is_complex = FRAME_MBAFF || MB_FIELD ||  
>>> IS_INTRA_PCM(mb_type) || s->codec_id != CODEC_ID_H264 ||  
>>> (ENABLE_GRAY && (s->flags&CODEC_FLAG_GRAY)) || s->encoding;
>>> -
>>> -    if(!s->decode)
>>> -        return;
>>> +    int is_complex = FRAME_MBAFF || MB_FIELD ||  
>>> IS_INTRA_PCM(mb_type) || s->codec_id != CODEC_ID_H264 ||  
>>> (ENABLE_GRAY && (s->flags&CODEC_FLAG_GRAY));
>>
>> the if(!decode) should be under a #ifdef h264_encoder
>
> attached, and split the very long line
>
>> diff -ru --exclude='*svn*' ffmpeg-/libavcodec/h264.c ffmpeg/ 
>> libavcodec/h264.c
>>>
>>> --- ffmpeg-/libavcodec/h264.c	2008-06-30 14:47:53.000000000 -0400
>>> +++ ffmpeg/libavcodec/h264.c	2008-06-30 14:47:59.000000000 -0400
>>> @@ -5517,7 +5517,7 @@
>>>        }
>>>    }
>>>
>>> -    for( coeff_count--; coeff_count >= 0; coeff_count-- ) {
>>> +    while( coeff_count-- ) {
>>>        uint8_t *ctx = coeff_abs_level1_ctx[node_ctx] +  
>>> abs_level_m1_ctx_base;
>>>
>>>        int j= scantable[index[coeff_count]];
>>
>> ok if faster or same speed
>
> Same speed with two compilers here, slightly smaller code. Since the  
> compiler practically randomly rewrites a simple loop condition like  
> this, I don't want to claim it's any faster.
>
>> [...]
>>> diff -ru --exclude='*svn*' ffmpeg-/libavcodec/h264.c ffmpeg/ 
>>> libavcodec/h264.c
>>> --- ffmpeg-/libavcodec/h264.c	2008-06-30 14:48:05.000000000 -0400
>>> +++ ffmpeg/libavcodec/h264.c	2008-06-30 14:48:11.000000000 -0400
>>> @@ -5552,11 +5552,9 @@
>>>            }
>>>
>>>            if( is_dc ) {
>>> -                if( get_cabac_bypass( CC ) ) block[j] = -coeff_abs;
>>> -                else                                block[j] =   
>>> coeff_abs;
>>> +                block[j] = get_cabac_bypass_sign( CC, -coeff_abs );
>>>            }else{
>>> -                if( get_cabac_bypass( CC ) ) block[j] = (- 
>>> coeff_abs * qmul[j] + 32) >> 6;
>>> -                else                                block[j] =  
>>> ( coeff_abs * qmul[j] + 32) >> 6;
>>> +                block[j] = (get_cabac_bypass_sign( CC, - 
>>> coeff_abs ) * qmul[j] + 32) >> 6;
>>>            }
>>
>> ok if faster or same speed
>
> Slightly faster, small code size reduction.

Here are more specific benchmarks. I was having trouble with the skip  
detection ignoring valid times for complex residuals, so I increased  
the skip threshold and then ran it in single-user mode to make sure  
that wouldn't introduce any problems.

Lossless sample (40mbit) -
Before-
23655 dezicycles in decode_cabac_residual, 524288 runs, 0 skips
24073 dezicycles in decode_cabac_residual, 1048576 runs, 0 skips
22958 dezicycles in decode_cabac_residual, 2097152 runs, 0 skips
22075 dezicycles in decode_cabac_residual, 4194304 runs, 0 skips

Use a while loop-
23570 dezicycles in decode_cabac_residual, 524288 runs, 0 skips
23991 dezicycles in decode_cabac_residual, 1048576 runs, 0 skips
22881 dezicycles in decode_cabac_residual, 2097152 runs, 0 skips
22006 dezicycles in decode_cabac_residual, 4194304 runs, 0 skips

Reorder if/else-
23567 dezicycles in decode_cabac_residual, 524288 runs, 0 skips
23988 dezicycles in decode_cabac_residual, 1048576 runs, 0 skips
22875 dezicycles in decode_cabac_residual, 2097152 runs, 0 skips
21994 dezicycles in decode_cabac_residual, 4194304 runs, 0 skips

Use bypass_sign-
22829 dezicycles in decode_cabac_residual, 524288 runs, 0 skips
23220 dezicycles in decode_cabac_residual, 1048576 runs, 0 skips
22162 dezicycles in decode_cabac_residual, 2097152 runs, 0 skips
21322 dezicycles in decode_cabac_residual, 4194304 runs, 0 skips

1080p x264 (~4mbit)-
Before-
2798 dezicycles in decode_cabac_residual, 131069 runs, 3 skips
3568 dezicycles in decode_cabac_residual, 262137 runs, 7 skips
3459 dezicycles in decode_cabac_residual, 524275 runs, 13 skips
3610 dezicycles in decode_cabac_residual, 1048555 runs, 21 skips

Use a while loop-
2761 dezicycles in decode_cabac_residual, 131071 runs, 1 skips
3540 dezicycles in decode_cabac_residual, 262134 runs, 10 skips
3403 dezicycles in decode_cabac_residual, 524275 runs, 13 skips
3541 dezicycles in decode_cabac_residual, 1048554 runs, 22 skips

Reorder if/else-
2710 dezicycles in decode_cabac_residual, 131072 runs, 0 skips
3489 dezicycles in decode_cabac_residual, 262141 runs, 3 skips
3358 dezicycles in decode_cabac_residual, 524279 runs, 9 skips
3495 dezicycles in decode_cabac_residual, 1048562 runs, 14 skips

Use bypass_sign-
2700 dezicycles in decode_cabac_residual, 131071 runs, 1 skips
3449 dezicycles in decode_cabac_residual, 262140 runs, 4 skips
3321 dezicycles in decode_cabac_residual, 524281 runs, 7 skips
3456 dezicycles in decode_cabac_residual, 1048564 runs, 12 skips





More information about the ffmpeg-devel mailing list