[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Sun May 18 10:18:27 CEST 2008

On Saturday 17 May 2008, Dmitry Antipov wrote:

[...]

> It looks you're right here. This version of pix_sum (1):
>
>      asm volatile("wzero wr0                 \n\t"
>                   "mov r1, #16               \n\t"
>                   "1: wldrd wr2, [%1, #8]    \n\t"
>                   "subs r1, r1, #1           \n\t" /* subs here */
>                   "wldrd wr1, [%1], %2       \n\t"
>                   "waccb wr2, wr2            \n\t"
>                   "waddw wr0, wr0, wr2       \n\t"
>                   "waccb wr1, wr1            \n\t"
>                   "waddw wr0, wr0, wr1       \n\t"
>                   "bne 1b                    \n\t"
>                   "textrmsw %0, wr0, #0      \n\t"
>
>                   : "=r"(s), "+r"(pix)
>                   : "r"(line_size)
>                   : "r1");
>
> is 6-7% faster than this (2):
>
>     asm volatile("wzero wr0                 \n\t"
>                   "mov r1, #16               \n\t"
>                   "1: wldrd wr2, [%1, #8]    \n\t"
>                   "wldrd wr1, [%1], %2       \n\t"
>                   "waccb wr2, wr2            \n\t"
>                   "waddw wr0, wr0, wr2       \n\t"
>                   "waccb wr1, wr1            \n\t"
>                   "waddw wr0, wr0, wr1       \n\t"
>                   "subs r1, r1, #1           \n\t" /* subs here */
>                   "bne 1b                    \n\t"
>                   "textrmsw %0, wr0, #0      \n\t"
>
>                   : "=r"(s), "+r"(pix)
>                   : "r"(line_size)
>                   : "r1");

BTW, this is a bit strange, ~7% improvement by saving 1 cycle means that we
have approximately (1 / 0.07) = 14.3 cycles per loop iteration. But the number
of instructions is only 8.

Please also try to benchmark
     asm volatile("wzero wr0                  \n\t"
                   "mov r1, #16               \n\t"
                   "1: wldrd wr2, [%1, #8]    \n\t"
                   "wldrd wr1, [%1], %2       \n\t"
                   "sub  r1, r1, #0           \n\t"
                   "sub  r1, r1, #0           \n\t"
                   "subs r1, r1, #1           \n\t"
                   "waccb wr2, wr2            \n\t"
                   "waccb wr1, wr1            \n\t"
                   "sub  r1, r1, #0           \n\t"
                   "sub  r1, r1, #0           \n\t"
                   "waddw wr0, wr0, wr2       \n\t"
                   "waddw wr0, wr0, wr1       \n\t"
                   "bne 1b                    \n\t"
                   "textrmsw %0, wr0, #0      \n\t"
vs
     asm volatile("wzero wr0                  \n\t"
                   "mov r1, #16               \n\t"
                   "1: wldrd wr2, [%1, #8]    \n\t"
                   "wldrd wr1, [%1], %2       \n\t"
                   "sub  r1, r1, #0           \n\t"
                   "sub  r1, r1, #0           \n\t"
                   "subs r1, r1, #1           \n\t"
                   "waccb wr2, wr2            \n\t"
                   "waddw wr0, wr0, wr2       \n\t"
                   "sub  r1, r1, #0           \n\t"
                   "sub  r1, r1, #0           \n\t"
                   "waccb wr1, wr1            \n\t"
                   "waddw wr0, wr0, wr1       \n\t"
                   "bne 1b                    \n\t"
                   "textrmsw %0, wr0, #0      \n\t"

The point is that WACC instruction is listed to have result latency 1 in the
main table of optimization manual (a good thing). But there is a section "Data
Hazards" which has the following information:

"The destination register (accumulator) for certain multiplier instructions
(WMAC, WSAD, TMIA, TMIAPH, TMIAxy) can be forwarded for accumulation to the
same destination register only. If the destination register results are needed
by another instruction as source operands, there is an additional result
latency as the result is available from the regular forwarding paths, external
to the multiplier.
 ? For WMAC, WACC, WMUL, WMADD, WMIAxy, WMIAWxy, WQMULM, WQMIAxy, WMULW 
and WQMULWM there is an additional result latency of 3 cycles.
 ? For WSAD, TMIA, TMIAPH, and TMIAxy there is an additional result latency 
of 4 cycles."

I must admit I'm not completely sure that I understand everything what it 
says correctly, but there may be cases when WACC instruction has a result
latency higher than 1 (is it 1 + 3 = 4?). If reordering WACC and WADD
instructions improves performance in this case, we should assume
that the result latency of WACC with the result used by WADD is 4 cycles.
That will make avoiding pipeline stalls a bit more difficult, but definitely
not impossible :)

It really would help to get oprofile working on your device and properly
using hardware performance counters. It could collect statistics about
pipeline stalls and help to ensure that nothing got missed or forgotten.

You can also roughly measure the number of cycles taken for each loop
iteration by using the information about CPU clock frequency, time of running 
the benchmark and total number of loop iteration. If the number of cycles is
very much off from the theoretical estimation (typically the number of
instructions in the loop), there is something wrong with instructions
scheduling and you need to dig into the optimization manual to find an
explanation.

-- 
Best regards,
Siarhei Siamashka