[FFmpeg-devel] [PATCH] Use av_clip_uint8 in swscale.

Sun Aug 16 17:18:02 CEST 2009

Michael Niedermayer <michaelni at gmx.at> writes:

> On Sun, Aug 16, 2009 at 01:19:39AM +0100, M?ns Rullg?rd wrote:
>> Michael Niedermayer <michaelni at gmx.at> writes:
>> 
>> > On Sat, Aug 15, 2009 at 05:53:49PM +0100, M?ns Rullg?rd wrote:
>> >> Reimar D?ffinger <Reimar.Doeffinger at gmx.de> writes:
>> >> 
>> >> > On Sat, Aug 15, 2009 at 12:27:49PM -0300, Ramiro Polla wrote:
>> >> >> diff --git a/swscale.c b/swscale.c
>> >> >> index c513066..340acfc 100644
>> >> >> --- a/swscale.c
>> >> >> +++ b/swscale.c
>> >> >> -            if ((u|v)&256){
>> >> >> -                if (u<0)        u=0;
>> >> >> -                else if (u>255) u=255;
>> >> >> -                if (v<0)        v=0;
>> >> >> -                else if (v>255) v=255;
>> >> >> -            }
>> >> >> -
>> >> >> -            uDest[i]= u;
>> >> >> -            vDest[i]= v;
>> >> >> +            uDest[i]= av_clip_uint8((chrSrc[i       ]+64)>>7);
>> >> >> +            vDest[i]= av_clip_uint8((chrSrc[i + VOFW]+64)>>7);
>> >> >
>> >> > And this need to be benchmarked (well, or at least have a look at the
>> >> > generated code.
>> >> > If clipping is very, very rare the original code might be faster.
>> >> 
>> >> Depends on hardware.  On processors with fast clipping instructions,
>> >> always clipping is likely to be faster.
>> >
>> > if they are fast enough, sure, but which cpu would that be?
>> 
>> ARM and AVR32 to name two.
>
> I dont really know ARM & AVR32 asm ...
> but i must admit that iam surprised that some cpu has cliping instructions
> that match in throughput a simple bitwise or. I guess i should spend
> more time with non x86 asm

ARM can shift and saturate in one cycle.  On AVR32 shift+sat has one
issue cycle and two cycles latency.  On either architecture, two of
those is definitely faster than some bitwise logic and a conditional
branch.

>> > besides which compiler would turn the pure C av_clip_uint8 into such
>> > instructions ?
>> 
>> We could write an asm version of it.
>
> yes but that brings us back to the issue of cpu specific optimizations
> in libavutil headers ...

... which we need to find an acceptable solution to.

> besides we would need more than a optimized av_clip_uint8() because on
> x86 4 or and 1 clip check is faster than 4 cliping checks

Shocking.

-- 
M?ns Rullg?rd
mans at mansr.com