[FFmpeg-devel] [PATCH] Use av_clip_uint8 in swscale.

Måns Rullgård mans
Tue Aug 18 01:51:33 CEST 2009


Frank Barchard <fbarchard at google.com> writes:

> 2009/8/17 M?ns Rullg?rd <mans at mansr.com>
>
>> And you might know it's unbounded.
>>
> yes.  So a general purpose function has to do it the slow way.  But this
> function is applied within swscaler on YUV data with a know range.

This function is meant to be generic, not only for use in libswscale.

>> > If you combined 3 bytes, its 768 values.
>>
>> "combined"?
>
> Image operations are usually a function of 2 images, or 4 channels, which
> puts a limit on how far out of range they can be.

Yes, *USUALLY*.  Do you know the difference between something that
*usually* works and something that *always* works?

>> > I know if statements are increasingly efficient, and memory less
>> > efficient, but the original code had 4 to 6 instructions and
>> > potentially 2 branches taken per clipped value.  av_clip_uint8()
>> > can be optimized to a single instruction on most CPU's
>>
>> Yes, on those with dedicated clip instructions.  Others will need
>> several instructions to support the full 32-bit range.  Even if the
>> range is known to be smaller, a table lookup can be slower than a few
>> compares and conditional instructions, and it poisons the cache
>> needlessly.
>
> Here's a benchmark on my code that is very similar.  This version, including
> YUV conversion, runs in 2.97ms
>
> static inline uint32 clip(int32 value) {
>
>   if (value < 0)    return 0u;   if (value > 65535)     return 255u;
> return static_cast<uint32>(value >> 8);}
>
> *This code runs in 2.11ms*
>
> static inline uint32 clip(int32 value) {  return
> static_cast<uint32>(g_rgb_clip_table[((value) >> 8) +
> kClipOverflow]);}

That means only that your compiler sucks.

> The table is read only, so the cache lines are not dirty, and image
> data tends to be coherent and only use a portion of table.  The tables
> for simple YUV clipping are 832 bytes.

That's a significant portion of a 16k cache.

>> >> > On x86, there is cmov, but in the above code it would take cmp,
>> >> > cmov, cmp, cmov to do each value, whereas the table method takes
>> >> > one mov instruction.
>> >>
>> >> You're forgetting the address calculation.
>> >
>> > movzx eax,cliptbl[eax*4]
>>
>> Now you're back at the 4GB table.  And where did the value of
>> "cliptbl" come from?  It would have to be loaded from somewhere.
>
> cliptbl is an array.  You can index directly off arrays on x86.

But not on x86-64 or most other modern CPUs.

-- 
M?ns Rullg?rd
mans at mansr.com



More information about the ffmpeg-devel mailing list