[FFmpeg-devel] Some IWMMXT functions for libavcodec

Siarhei Siamashka siarhei.siamashka
Sat May 17 16:43:01 CEST 2008

On Saturday 17 May 2008, Dmitry Antipov wrote:
> Siarhei Siamashka wrote:
> > Does Intel contradict itself? Or there is some variation between
> > different revisions of XScale cores and they have different optimization
> > rules? Can you provide a direct link to the document you are using?
> There are two generations of WMMX hardware for now - WMMX inside PXA27x
> cores and WMMX2 inside PXA3xx cores (read the PXA genealogy at
> http://en.wikipedia.org/wiki/XScale if you're don't familiar with it).
> The specification at http://www.intel.com/design/intelxscale/314510.htm
> describes WMMX2, but there is another (older) specification of WMMX. I
> can't find a direct link on Intel's sites, but you can grab my copy at

OK, thanks for the link.

> I'm using the hardware based on PXA310
> (http://www.marvell.com/products/cellular/application/pxa310.jsp). But the
> PXA27x cores are not out of the business - in fact, they forms today's
> end-user hardware mainstream, and PXA3x hardware goes to replace them in
> the near future.
> As I understand, WMMX2 is a strict superset of WMMX in the sense of
> instructions semantic - it adds new instructions, but the rest is fully
> backward compatible. But WMMX and WMMX2 are (how many?) different on the
> hardware level, so the code which is perfectly tuned for WMMX2 may be not
> so perfect on WMMX.

Yes, that's true. Actually the difference between WLDRD from iwmmx and iwmmx2
seems to be that iwmmxt has WLDRD instruction having latency 4 and a limited
support for back-to-back loads. And iwmmxt2 reduces this latency to 3 but
stalls on any back-to-back WLDRD load.

So your code has 2 cycles stall for each pair of WLDRD instructions on all
XScale cores (with older cores - 2 cycles stall because of load latency, 
with newer cores - 1 cycle stall because of using back-to-back loads plus
one more cycle because of load latency).

If you want to have code which runs fast on all XScale cores, you should
avoid any back-to-back WLDRD instructions and assume WLDRD latency to be 4.

> > One more interesting issue with WLDRD instruction is that it should
> > support register offset addressing mode according to the manual. So you
> > should have been able to use:
> >     wldrd wr2, [%1, #8]
> >     wldrd wr1, [%1], %2
> > instead of
> >     wldrd wr1, [%1]
> >     wldrd wr2, [%1, #8]
> >     add %1, %1, %2
> >
> > But the toolchain I'm using (also tried gcc 4.3 and binutils 2.18) seems
> > to silently ignore register offset and generates wrong instruction here
> > (without register postincrement). Either I'm misunderstanding something,
> > or it is a bug in binutils. Could you please try to investigate it
> > further and submit a bugreport to binutils if needed?
> I'm using ancient (but proven to be stable) gcc 3.4.3 and binutils 2.15.94
> (dated 20041215). This toolchain understands constant post-increment like
>      wldrd wr0, [%0], #8
> but not register post-increment like
>      wldrd wr0, [%0], %1
> An attempt to compile the last example issues and error from as:
> test.s:36:Error: # or { expected after comma -- `wldrd wr0,[r4],r6'
> Indeed, this is strange, and I'll try to investigate it.

Support for register offset addressing mode for WLDRD is also new in iwmmxt2.
Older XScale cores do not support it (and so do older versions of binutils).
But newer versions of binutils seem to be accepting the instruction and
generate invalid code.

Anyway, please try the attached code (completely untested). I can't guarantee
that it is correct, but it should be scheduled better (and it assumes that 'h'
is >= 4 and multiple of 2, but could be easily extended to support other

It still would be nice if you could benchmark performance of each function
and compare different implementations. So that we could see the progress :)

Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-iwmmxt.c
Type: text/x-csrc
Size: 2740 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080517/eaf3db2a/attachment.c>

More information about the ffmpeg-devel mailing list