[Ffmpeg-devel] [RFC] Addition of JIT accelerated scaler for ARM into libswscale

Tue Jan 23 01:03:10 CET 2007

On Tuesday 23 January 2007 01:12, Guillaume POIRIER wrote:

> > A natural solution for getting good scaler performance is to use JIT
> > style dynamic code generation. I spent full two days on the last weekend
> > and got some initial scaler implementation working (it is quite simple
> > and straightforward and uses less than 300 lines of code):
> > https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libswscale_noki
> >a770/?root=mplayer
> >
> > Its API is quite similar to libswscale, but a bit simplified. You need to
> > initialize scaler context by providing source and destination resolution,
> > and also quality level setting. Code for scaling of a horizontal line of
> > pixels is dynamically generated on this stage. Once context is
> > initialized, it can be used to scale planar YUV image data and get
> > results in YUY2 format.
>
> I may sound like a rookie to ask this, but could you tell me what
> dynamic code generation precisely allows to do that can't be done with
> "straight code"?
> Also, why (optimized) dynamic code can be faster that "straight code"?

We need a pixel line scaler function that converts N pixels to M here.

There is one important difference between dynamically generated and static
code. Your static precompiled code does not know N and M values beforehand 
and needs to handle all the cases at runtime by introducing some extra logic,
branching or conditionally executed code. If you have any additional
information, you can get a faster implementation. Some obvious example is the
case when N == M. A special unscaled variant is a lot faster than universal :)

In my tests from the previous post, scaling from 640 pixels to 400 was
required. An universal function can't get anything useful from this
information. But if we need a nearest neighbour scaler for example, generating
code for such line scaler is simple: we will just have to take bytes from some
offsets in the source buffer and put them to some offsets into destination
buffer. So we will generate a stright set of instructions to get 400 bytes
read and 400 bytes written to some predefined locations (no condition checks
and no offsets calculations are needed). That's why dynamically generated code
is faster. Surely, if you know source and destination image width at compile
time, you can develop a special optimized implementation. But you can't put
all the possible variants of this function into the executable. And dynamic 
code generator can be treated as a black box which can provide a (somewhat)
optimized function for each particular M and N values at runtime whenever you
need it. 

> I have never written a single line of such kind of code, so I'm
> curious. Plus, modern CPUs (PPC, x86 at least) make it harder to
> program efficient dynamic code, so I heard.
> For instance, if I remember correctly, P4 flushes its trace cache
> whenever code cache is written.... pretty un-efficient, isn't it?

Code is written only at the stage of line scaler function generation (at
initialization), so it does not matter much. When actually performing scaling,
it is only executed and not modified. The only important requirement here is
that this scaler function should fit instructions cache.