[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2
Thu May 22 14:41:59 CEST 2008
Siarhei Siamashka wrote:
> The goal is to get some cycle precise way of measuring performance, because
> there are still too many unresolved mysteries here.
Good news - it's done. Due to a typo in Marvell docs (THANK YOU, Marvell), my
PMU was programmed to count the time in the units of 64 clock cycles. Now it's
fixed, and the following code:
int b, e;
asm volatile("mrc p14, 0, %0, c1, c1, 0\n\t" /* read CCNT */
"mov r0, #32\n\t"
"1: add r1, r1, r0\n\t"
"subs r0, r0, #1\n\t"
"mrc p14, 0, %1, c1, c1, 0\n\t" /* read CCNT */
: "=r"(b), "=r"(e) : : "r0", "r1");
printk("%d\n", e - b);
reports 117 cycles (probably 96 for loop body + 1 MOV + 2 * 10 for MRC). This precise
benchmark was done from the kernel context to allow direct MRC and avoid system call
as well as other possible user-space overheads.
Look at http://188.8.131.52/tmp/loadwmmx-linear.c vs. http://184.108.40.206/tmp/loadwmmx-pipelined.c.
The first takes 22 cycles, and the second takes 23. But both of them has magic place (commented as
such): after inserting NOP in it, first version speedups to 18 cycles, and the second - to 19.
This was just a warm-up - look at http://220.127.116.11/tmp/loadwmmx.c. It gives 23 for both.
1) inserting NOP at magic point 1 gives 19/23;
2) inserting NOP at magic point 2 gives 23/23 in most of the runs, but 23/19 sometimes;
3) inserting NOP at both magic points gives 23/27.
So, the best result achieved by linear version is 18 cycles, and 19 for the pipelined one.
I suspect that 'real' speed depends from the code size and it's layout - function body size and
it's placement within the instruction cache should be taken into account here.
This test definitely shows that up to 8 back-to-back loads is good enough for XScale cores with
WMMX2. But, to be as good as possible on older cores with WMMX, I suspect it's better to avoid
such loads anyway.
More information about the ffmpeg-devel