[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Tue May 20 20:31:59 CEST 2008

On Tuesday 20 May 2008, Dmitry Antipov wrote:
> Yes, your test_cachemiss is ~13% slower on XScale too. So doing
> pre-increment if possible should be considered as a good idea (for pix_sum,
> it's at the cost of having 1 'sub').

You can have pre-increment without any cost.

> So, I'm voting for something like the following for pix_sum (it might be
> a bit unreadable without preprocessing :-):

[...]

> If 'pix' is fully cached, WMMX2 version is ~19% faster. Otherwise,
> it goes at the same speed as WMMX (it looks like loading uncached
> data is quite expensive, so an overhead introduced by 'add's is
> marginal).

Please add the following implementation of "pix_sum" function to your
benchmark set and post the results. I strongly suspect that it is a lot 
faster than any of your variants.

#define SUM1()                  \
    "wldrd wr1, [%1, %2]! \n\t" \
    "wsadb wr3, wr2, wr0  \n\t" \
    "wldrd wr2, [%1, #8]  \n\t" \
    "wsadb wr3, wr1, wr0  \n\t"

#define SUM4() \
    SUM1() \
    SUM1() \
    SUM1() \
    SUM1()

int pix_sum_iwmmxt2_pipelined(uint8_t *pix, int line_size)
{
    int s;
    asm volatile(
        "wldrd wr1, [%1]           \n\t"
        "wzero wr0                 \n\t"
        "wldrd wr2, [%1, #8]       \n\t"
        "wsadbz wr3, wr1, wr0      \n\t"
        SUM1()
        SUM1()
        SUM1()
        SUM4()
        SUM4()
        SUM4()
        "wsadb wr3, wr2, wr0       \n\t"
        "textrmsw %0, wr3, #0      \n\t"
        : "=r"(s), "+r"(pix)
        : "r"(line_size));
    return s;
}

-- 
Best regards,
Siarhei Siamashka