[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Thu May 15 18:43:35 CEST 2008

On Thu, May 15, 2008 at 04:15:28PM +0400, Dmitry Antipov wrote:
> Hello again,
>
> here are some more efforts on IWMMXT stuff for libavcodec - all inner
> loops are rewritten in assembly and a few more functions added.
>
> The performance measurement is not so simple because:
>  1) gprof doesn't provide reliable results for small (20-30 instructions)
>     functions;
>  2) the hardware provides the fast, low-overhead clock source (similar
>     to x86 TSC), but it may be accessed from the privileged mode (i.e. 
> kernel)
>     only;
>  3) although 2) may be done via oprofile, there is no oprofile support for
>     my hardware yet :-(.
>
> So, the only benchmark I'm using for now is the simple 'synthetic' 
> benchmark
> which measures 'all C' vs. 'all IWMMXT' stuff with plain gettimeofday() 
> (see
> speedrun() functions within http://78.153.153.8/tmp/dspwmmx.c; it's 24K, so
> not attached here). There are some results from it:
>
> MAX_SIZE MAX_LINE C     IWMMXT Speedup
> --------------------------------------
> 64       8        100   40     2.5
> 128      16       560   190    2.95
> 256      32       2120  660    3.21
> 512      64       6670  1970   3.39
> 1024     128      19090 5360   3.56
>
> (GCC 3.4.3, '-fomit-frame-pointer -O3', XScale Core3 at 312 MHz and 622 
> BogoMIPS).
>
> According to these results, I suppose that IWMMXT functions are ~2-3 times 
> faster
> in general, but the mileage may vary from function to function - I didn't 
> perform
> per-function measurements yet.

You have to test per function so you know if a change to the asm improved
performance or not, testing all at once is less accurate ...

also beware of alignment, synthetic tests could easily be aligned better or
worse than the actual use ...

>
> It would be nice if someone proposes a real video processing task which 
> loads these
> functions heavily.

[...]
> +static int pix_sum_iwmmxt(uint8_t *pix, int line_size)
> +{
> +    int s;
> +
> +    asm volatile("wzero wr0             \n\t"
> +                 "mov r1, #16           \n\t"
> +                 "1: wldrd wr1, [%1]    \n\t"
> +                 "waccb wr1, wr1        \n\t"
> +                 "waddw wr0, wr0, wr1   \n\t"
> +                 "wldrd wr1, [%1, #8]   \n\t"
> +                 "waccb wr1, wr1        \n\t"
> +                 "waddw wr0, wr0, wr1   \n\t"
> +                 "add %1, %1, %2        \n\t"
> +                 "subs r1, r1, #1       \n\t"
> +                 "bne 1b                \n\t"
> +                 "textrmsw %0, wr0, #0  \n\t"

i would suspect that reordering these instructions a little
would improve speed, this applies to all functions.

[...]
> +static int vsad_intra16_iwmmxt(void *c, uint8_t *pix, uint8_t *dummy, int stride, int h)
> +{
> +    int s;
> +
> +    asm volatile("mov r1, %3            \n\t"
> +                 "wzero wr0             \n\t"
> +                 "1: wldrd wr1, [%1]    \n\t"
> +                 "wldrd wr2, [%1, #8]   \n\t"
> +                 "add %1, %1, %2        \n\t"
> +                 "wldrd wr3, [%1]       \n\t"
> +                 "wldrd wr4, [%1, #8]   \n\t"
> +                 "wsadbz wr1, wr1, wr3  \n\t"
> +                 "wsadbz wr2, wr2, wr4  \n\t"
> +                 "waddw wr0, wr0, wr1   \n\t"
> +                 "waddw wr0, wr0, wr2   \n\t"
> +                 "subs r1, r1, #1       \n\t"
> +                 "bne 1b                \n\t"

half of the loads in there are redundant, this also applies to a few
other functions

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Asymptotically faster algorithms should always be preferred if you have
asymptotical amounts of data
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080515/003c2df8/attachment.pgp>