[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Tue May 20 17:06:37 CEST 2008

Siarhei Siamashka wrote:

>> ... Now we can shift all the B operations relative to A down and get the
>> following: 
>>
>> A1
>> .. - empty slot
>> A2
>> B1
>> A3
>> B2
>> A4
>> B3
>> B4
>>
>> This code will take 9 cycles.
> 
> Except that will be actually 10 cycles and the following pattern, sorry for
> typo:
> 
> A1
> .. - empty slot
> A2
> B1
> A3
> B2
> A4
> B3
> .. - empty slot
> B4

That's OK. But, as for pix_sum, if we have nothing to fill empty slot,

A1
A2
A3
A4
B1
B2
B3
B4

is as fast as

A1
A2
A3
B1
A4
B2
B3
B4

And, as it's expected,

A1
A2
B1
A3
B2
A4
B3
B4

is slower.

Yes, your test_cachemiss is ~13% slower on XScale too. So doing pre-increment
if possible should be considered as a good idea (for pix_sum, it's at the cost
of having 1 'sub').

So, I'm voting for something like the following for pix_sum (it might be
a bit unreadable without preprocessing :-):

#ifdef HAVE_IWMMXT2
#define LOAD16_iwmmxt2(w0,w1,r0,r1)               \
     "wldrd wr" #w1 ", [%" #r0 ", %" #r1 "]! \n\t" \
     "wldrd wr" #w0 ", [%" #r0 ", #8]        \n\t"

#define SUB_iwmmxt2(x,y) "sub %" #x ", %" #x ", %" #y "\n\t"
#endif

#define LOAD16_iwmmxt(w0,w1,r0,r1)              \
     "wldrd wr" #w0 ", [%" #r0 "]          \n\t" \
     "wldrd wr" #w1 ", [%" #r0 ", #8]      \n\t" \
     "add %" #r0 ", %" #r0 ", %" #r1 "     \n\t"

#define SUB_iwmmxt(x,y)

#define SUM(name,x,y,z,t)           \
     LOAD16_ ##name(x, y, 1, 2)      \
     LOAD16_ ##name(z, t, 1, 2)      \
     "wsadb wr0, wr" #x ", wr5 \n\t" \
     "wsadb wr0, wr" #y ", wr5 \n\t" \
     "wsadb wr0, wr" #z ", wr5 \n\t" \
     "wsadb wr0, wr" #t ", wr5 \n\t"

#define DEF_PIX_SUM(name)                               \
static int pix_sum_ ##name(uint8_t *pix, int line_size) \
{                                                       \
     int s;                                              \
                                                         \
     assert(!((unsigned long)pix & 7));                  \
     assert(!(line_size & 7));                           \
                                                         \
     asm volatile("wzero wr0                 \n\t"       \
                  "wzero wr5                 \n\t"       \
                  SUB_ ##name(1, 2)                      \
                  SUM(name, 1, 2, 3, 4)                  \
                  SUM(name, 1, 2, 3, 4)                  \
                  SUM(name, 1, 2, 3, 4)                  \
                  SUM(name, 1, 2, 3, 4)                  \
                  SUM(name, 1, 2, 3, 4)                  \
                  SUM(name, 1, 2, 3, 4)                  \
                  SUM(name, 1, 2, 3, 4)                  \
                  SUM(name, 1, 2, 3, 4)                  \
                  "textrmsw %0, wr0, #0      \n\t"       \
                  : "=r"(s), "+r"(pix)                   \
                  : "r"(line_size));                     \
     return s;                                           \
}

DEF_PIX_SUM(iwmmxt)
#ifdef HAVE_IWMMXT2
DEF_PIX_SUM(iwmmxt2)
#endif
#undef DEF_PIX_SUM
#undef SUM

If 'pix' is fully cached, WMMX2 version is ~19% faster. Otherwise,
it goes at the same speed as WMMX (it looks like loading uncached
data is quite expensive, so an overhead introduced by 'add's is
marginal).

Dmitry