[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Dmitry Antipov dmantipov
Thu May 22 14:41:59 CEST 2008

Siarhei Siamashka wrote:

> The goal is to get some cycle precise way of measuring performance, because
> there are still too many unresolved mysteries here.

Good news - it's done. Due to a typo in Marvell docs (THANK YOU, Marvell), my
PMU was programmed to count the time in the units of 64 clock cycles. Now it's
fixed, and the following code:

int b, e;
asm volatile("mrc p14, 0, %0, c1, c1, 0\n\t" /* read CCNT */
              "mov r0, #32\n\t"
              "1: add r1, r1, r0\n\t"
              "subs r0, r0, #1\n\t"
              "bne 1b\n\t"
              "mrc p14, 0, %1, c1, c1, 0\n\t" /* read CCNT */
              : "=r"(b), "=r"(e) : : "r0", "r1");
printk("%d\n", e - b);

reports 117 cycles (probably 96 for loop body + 1 MOV + 2 * 10 for MRC). This precise
benchmark was done from the kernel context to allow direct MRC and avoid system call
as well as other possible user-space overheads.

Look at vs.
The first takes 22 cycles, and the second takes 23. But both of them has magic place (commented as
such): after inserting NOP in it, first version speedups to 18 cycles, and the second - to 19.

This was just a warm-up - look at It gives 23 for both.
  1) inserting NOP at magic point 1 gives 19/23;
  2) inserting NOP at magic point 2 gives 23/23 in most of the runs, but 23/19 sometimes;
  3) inserting NOP at both magic points gives 23/27.

So, the best result achieved by linear version is 18 cycles, and 19 for the pipelined one.

I suspect that 'real' speed depends from the code size and it's layout - function body size and
it's placement within the instruction cache should be taken into account here.

This test definitely shows that up to 8 back-to-back loads is good enough for XScale cores with
WMMX2. But, to be as good as possible on older cores with WMMX, I suspect it's better to avoid
such loads anyway.


More information about the ffmpeg-devel mailing list