[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Wed May 21 15:19:33 CEST 2008

On Wednesday 21 May 2008, Dmitry Antipov wrote:
> Siarhei Siamashka wrote:
> > Please add the following implementation of "pix_sum" function to your
> > benchmark set and post the results. I strongly suspect that it is a lot
> > faster than any of your variants.
>
> I've updated http://78.153.153.8/tmp/pix_sum.c and
> http://78.153.153.8/tmp/pix_sum.txt (BTW, it might be offline for now due
> to some issues with my internet connection).
>
> This is an extract from pix_sum.txt (PMUs - performance monitoring unit
> clock cycles, [16], [32], etc. is the pix_sum line size):

[...]

> These 0.1-0.4% are marginal, but stable - few tens of runs gives an
> approximately the same percents, and your's version was never faster.
>
> As for code size, both versions contains 68 instructions.
>
> pix_sum_iwmmxt2_last() was:

[...]

This is strange. If we assume that back-to-back WLDRD instructions
introduce 1 cycle stall and WLDRD result latency is 3 cycles (like
WMMX2 optimization manual describes), "pix_sum_iwmmxt2_pipelined" 
should have no stalls except for a few unavoidable ones in the 
very function epilogue.

While your version should have a lot more additional stalls 
because of back-to-back loads (22 cycles). And all the Intel
WMMX manuals clearly state that CPU can't sustain the rate of
loading 64-bit data on each cycle, so your code is not optimal.

This all makes me think that we don't have a clear vision about
how your CPU works. Is WLDRD result latency actually 4 on your CPU?
Or you can't immediately load new value into the register still 
"locked" by previous operation? Or whatever? Knowing what happens 
there will help us to get the fastest code.

Please make some practical experiments to check if there are some 
stalls in the code and where they are located. 

Regarding your benchmark. How much is 1 PMU in cycles? Please add 
compensation for syscall overhead to your benchmarking code. Something like:

t1 = ccnt();
t2 = ccnt();
benchmark();
t3 = ccnt();

t = (t3 - t2) - (t2 - t1);

If PMU resolution is too low and 1 PMU is much more than 1 cycle, you can
consider the following way to arrange test. Function benchmark() can consist
of a loop which calls some test function via pointer. In order to compensate
loop and function calls overhead, you can have some empty dummy function and
subtract the time of running it. Something like this:

void benchmark(void (*testfn)())
{
    int i;
    for (i = 0; i < N; i++)
        testfn();
}

void dummytestfn()
{
}

void realtestfn()
{
/* do some stuff */
}

Subtracting time of "benchmark(dummytestfn)" from "benchmark(realtestfn)" will
give you the time of executing "realtestfn" function body. But you need to
inspect generated code and make sure that compiler does not inline these 
functions screwing up the results.

The goal is to get some cycle precise way of measuring performance, because
there are still too many unresolved mysteries here.

-- 
Best regards,
Siarhei Siamashka