[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Siarhei Siamashka siarhei.siamashka
Wed May 21 17:02:28 CEST 2008

On Wednesday 21 May 2008, Dmitry Antipov wrote:
> So, you're always assuming that
> wldrd wr0, [%1]
> wldrd wr1, [%1, #8]
> _always_ has 1 cycle stall. But it looks like this is true for WMMX, but
> not for WMMX2.

That's why I'm asking you to make a cycle precise benchmark and thoroughfully
test this sequence of instructions.

Also measuring a precise total number of CPU cycles per "pix_sum" function
call will let us know if we have any unexpected stalls there or not.
Everything else is a pure speculation now.

You can insert some equivalent of NOP (something like "add r0, r0, #0") at the
places of code which a suspected of having a stall and check if these extra
instructions will affect performance (if there is a stall there, performance
will not change). Your previous benchmarks also showed performance
improvements from separating WLDRD instructions (by inserting loop decrement
instruction between them).

I think Intel manual may describe 64-bit external memory interface and 8 slots
for hit-under-miss logic there. But looks like they also describe that you
can't repeatedly load 64-bit data from L1 cache on each cycle (that means, it
internally loads only 32-bits from L1 data cache per cycle, but the second
part of WLDRD load operation can run "in the background" simultaneously with
other non-memory related instructions). But I may be wrong...

> I'm definitely interesting in obtaining 'old' WMMX core and check this,
> BTW.

Sure it would be interesting. But I haven't seen anybody with XScale hardware
for a long time in this list. Looks like they are now extinct.

I even considered buying some kind of device with XScale core myself, but
could not justify this purchase in the end :)

Best regards,
Siarhei Siamashka

More information about the ffmpeg-devel mailing list