[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Thu May 22 18:38:30 CEST 2008

On Thursday 22 May 2008, Dmitry Antipov wrote:
> Siarhei Siamashka wrote:
> > The goal is to get some cycle precise way of measuring performance,
> > because there are still too many unresolved mysteries here.
>
> Good news - it's done. Due to a typo in Marvell docs (THANK YOU, Marvell),
> my PMU was programmed to count the time in the units of 64 clock cycles.
> Now it's fixed, and the following code:
>
> int b, e;
> asm volatile("mrc p14, 0, %0, c1, c1, 0\n\t" /* read CCNT */
>               "mov r0, #32\n\t"
>               "1: add r1, r1, r0\n\t"
>               "subs r0, r0, #1\n\t"
>               "bne 1b\n\t"
>               "mrc p14, 0, %1, c1, c1, 0\n\t" /* read CCNT */
>
>               : "=r"(b), "=r"(e) : : "r0", "r1");
>
> printk("%d\n", e - b);
>
> reports 117 cycles (probably 96 for loop body + 1 MOV + 2 * 10 for MRC).
> This precise benchmark was done from the kernel context to allow direct MRC
> and avoid system call as well as other possible user-space overheads.

That's a really great news!  

> Look at http://78.153.153.8/tmp/loadwmmx-linear.c vs.
> http://78.153.153.8/tmp/loadwmmx-pipelined.c. The first takes 22 cycles,
> and the second takes 23. But both of them has magic place (commented as
> such): after inserting NOP in it, first version speedups to 18 cycles, and
> the second - to 19.
>
> This was just a warm-up - look at http://78.153.153.8/tmp/loadwmmx.c. It
> gives 23 for both. But:
>   1) inserting NOP at magic point 1 gives 19/23;
>   2) inserting NOP at magic point 2 gives 23/23 in most of the runs, but
> 23/19 sometimes; 3) inserting NOP at both magic points gives 23/27.

This magic is most likely related to the fact that WSADB instruction has
additional latency when the result is used not WSAD-alike instructions 
which have a special forwarding path. This unusual WSAD behaviour is
somewhat mentioned in manual. But the fact that inserting NOP improves
performance is somewhat unexpected.

It is quite understandable why pipelined version is 1 cycle slower:

A01
... - empty slot
A02
B01
A03
B02
{some more stuff}
A98
B97
A99
B98
... - empty slot
B99

Here we have one stall in the end of function which can't be filled with
anything useful. That's a price of separating WLDRD instructions from
each other (if/when such separation is needed).

> So, the best result achieved by linear version is 18 cycles, and 19 for the
> pipelined one.
>
> I suspect that 'real' speed depends from the code size and it's layout -
> function body size and it's placement within the instruction cache should
> be taken into account here.

There are two (potential) issues in your "loadwmmx.c" test code.

First it is better to do some "warming up" and call the tested functions at
least once before doing benchmark, so that they are loaded to instructions
cache and all the data they use gets loaded to the data cache. You need to
be careful with data cache, because ARM cores may have read-allocate cache
behaviour configured. With read-allocate cache, cache lines are not allocated
on write misses, so just initializing array by writing to it may be not
enough to ensure that it got into cache.

Second and more important. Your test buffer is a byte array and it is not
guaranteed to be 8-byte aligned. WLDRD instruction requires strict 8-byte
alignment at least for WMMX1 cores (if documentation is not completely wrong).
You can check '/proc/cpu/aligment' to configure behaviour of your CPU when
it performs unaligned memory reads/writes. Theoretically, it could be that
CPU just ignores unaligned WLDRD reads and your benchmark does not make much
sense. But it also could be that PXA3xx cores can support unaligned memory
access, don't know.

And also please attach your code fragments to mail messages, they should
not be that large and you can always compress them. I doubt that you will keep
these files on your server forever :) And if some guy will be reading mailing
list archives later, trying to find some useful WMMX optimizations related 
information, he may have hard times trying to figure out what we were talking 
about.

> This test definitely shows that up to 8 back-to-back loads is good enough
> for XScale cores with WMMX2. 

Agreed (after you confirm the results with the fixes for glitches mentioned
above). This also proves, that you can't put too much trust in the vendor
supplied optimization manuals (or at least you can't be always sure that 
you understand them correctly). Looks like you need to bug Intel/Marvell 
to clarify their optimization docs :)

> But, to be as good as possible on older cores 
> with WMMX, I suspect it's better to avoid such loads anyway.

Yes, it is probably better not to use such loads in the code, optimized for
older cores. But running benchmarks on some PXA27x core would definitely help
us to make a right decision.

-- 
Best regards,
Siarhei Siamashka