[FFmpeg-devel] [RFC] An improved implementation of ARMv5TE IDCT (simple_idct_armv5te.S)

Wed Sep 12 10:31:02 CEST 2007

Hello all,

I have been working on improving IDCT performance for ARM for some time
already and now think that it is more or less ready (performance wise) to 
get accepted upstream:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv4l/simple_idct_armv5te.S?root=mplayer&view=markup

The following simple test program is used to ensure that new idct provides
results that are bit identical to the existing implementation and the
performance is better for all the typical and also corner cases:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/tests/test-idct.c?root=mplayer&view=markup

More detailed instructions about running this test and interpreting its
results can be found in this forum (I asked XScale users to run some
benchmarks in order to confirm performance improvement on XScale and 
avoid any possible regressions):
http://www.oesf.org/forums/index.php?showtopic=22280&st=50

The current results are that IDCT performance got improved at about ~20% on
ARM9E and is more than 1.5x faster on ARM11 and XScale cores. Naturally, there
is a visible performance improvement on mpeg4 videos decoding as well.

An interesting observation is that this ARMv5TE code is practically as 
fast as current ARMv6 implementation when considering only the number
crunching part, but falls behind on storing pixel data to memory as
ARMv6 can do cropping to 0-255 range much more efficiently because of 
having special instructions for that.

Current speedup is provided by scheduling instructions so that there are
(almost) no pipeline stalls even on longer pipeline cores ARM11 and 
XScale with higher latencies. Also 64-bit loads are heavily used (ARM11 
and XScale can load two 32-bit registers per cycle). The extensive use 
of MAC instructions combined with fast constants loading improves 
performance even on ARM9E. One more interesting performance related
thing to note is that ARM cores do not allocate cache line on write 
misses, but use write buffer which can hold some number of pending 
write requests before they can be flushed to memory. When target
buffer for 'simple_idct_put_armv5te' is not in cache, processing
two columns at a time and storing 32 16-bit values to memory instead
of performing 64 8-bit writes, prevents write buffer overflow, 
pipeline stall and performance loss as a result (can be seen at 
'simple_idct_put_armv6' benchmark at the end of this post). By the way,
setting 'readable' variable to 0 in 'MPV_decode_mb_internal' helps
to avoid this problem, improves performance and can be used as a 
workaround. I have plans for adding some more substantial cache related
optimizations later and can share some thoughts about how it could be 
done.

So now the question is: how it would be best to integrate this idct code 
into ffmpeg? Should it replace the current simple_idct_armv5te.S file? 
Could you please review the copyright part to check if it is ok?

I can provide a patch with omitted experimental prefetch code, all the globals
getting 'ff_' prefix, use of 'ff_cropTbl' instead of keeping its own copy of
cropping table (by the way, it would be nice to extend MAX_NEX_CROP to 2048
as idct can produce results in +-2K range when feeded with completely random
data on input).

Here are the current synthetic test results:

=== Testing on Nokia 770 (ARM9E core, 252MHz, 16K of data cache) ===

./test-idct --freq=252
avg=-0.08, stddev=36.96, min=-168.00, max=149.00
Assuming cpu clock frequency 252MHz (ARMv6 disabled)
Please be patient and wait for the results, test requires quite a lot of time 
to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te  time=856.4
simple_idct_put_armv5te  cache=no,  time=1041.3
simple_idct_put_armv5te  cache=yes, time=1058.9
simple_idct_add_armv5te  cache=no,  time=1336.5
simple_idct_add_armv5te  cache=yes, time=1215.3
simple_idct_add_pf_pld_armv5te  cache=no,  time=1340.3
simple_idct_add_pf_pld_armv5te  cache=yes,  time=1220.4
simple_idct_armv5te_ref  time=1059.1
simple_idct_put_armv5te_ref  cache=no,  time=1283.0
simple_idct_put_armv5te_ref  cache=yes, time=1275.9
simple_idct_add_armv5te_ref  cache=no,  time=1625.9
simple_idct_add_armv5te_ref  cache=yes, time=1469.0
--- benchmarking with random idct coefficients ---
simple_idct_armv5te  time=1391.1
simple_idct_put_armv5te  cache=no,  time=1607.3
simple_idct_put_armv5te  cache=yes, time=1602.3
simple_idct_add_armv5te  cache=no,  time=1913.9
simple_idct_add_armv5te  cache=yes, time=1800.2
simple_idct_add_pf_pld_armv5te  cache=no,  time=1966.6
simple_idct_add_pf_pld_armv5te  cache=yes,  time=1835.5
simple_idct_armv5te_ref  time=1716.4
simple_idct_put_armv5te_ref  cache=no,  time=1953.5
simple_idct_put_armv5te_ref  cache=yes, time=1918.2
simple_idct_add_armv5te_ref  cache=no,  time=2282.7
simple_idct_add_armv5te_ref  cache=yes, time=2129.8

=== Testing on Nokia N800 (ARM11 core, 330MHz, 32K of data cache) ===

./test-idct --freq=330 --enable-armv6
avg=-0.08, stddev=36.96, min=-168.00, max=149.00
Assuming cpu clock frequency 330MHz (ARMv6 enabled)
Please be patient and wait for the results, test requires quite a lot of time 
to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te  time=686.2
simple_idct_put_armv5te  cache=no,  time=804.9
simple_idct_put_armv5te  cache=yes, time=785.3
simple_idct_add_armv5te  cache=no,  time=998.9
simple_idct_add_armv5te  cache=yes, time=871.7
simple_idct_add_pf_pld_armv5te  cache=no,  time=972.1
simple_idct_add_pf_pld_armv5te  cache=yes,  time=879.1
simple_idct_armv5te_ref  time=1092.1
simple_idct_put_armv5te_ref  cache=no,  time=1307.3
simple_idct_put_armv5te_ref  cache=yes, time=1287.1
simple_idct_add_armv5te_ref  cache=no,  time=1529.5
simple_idct_add_armv5te_ref  cache=yes, time=1405.1
simple_idct_armv6  time=760.2
simple_idct_put_armv6  cache=no,  time=1030.8
simple_idct_put_armv6  cache=yes, time=773.0
simple_idct_add_armv6  cache=no,  time=1051.6
simple_idct_add_armv6  cache=yes, time=909.4
--- benchmarking with random idct coefficients ---
simple_idct_armv5te  time=1153.5
simple_idct_put_armv5te  cache=no,  time=1273.6
simple_idct_put_armv5te  cache=yes, time=1257.5
simple_idct_add_armv5te  cache=no,  time=1468.6
simple_idct_add_armv5te  cache=yes, time=1345.7
simple_idct_add_pf_pld_armv5te  cache=no,  time=1402.8
simple_idct_add_pf_pld_armv5te  cache=yes,  time=1400.5
simple_idct_armv5te_ref  time=1872.3
simple_idct_put_armv5te_ref  cache=no,  time=2100.2
simple_idct_put_armv5te_ref  cache=yes, time=2079.0
simple_idct_add_armv5te_ref  cache=no,  time=2327.7
simple_idct_add_armv5te_ref  cache=yes, time=2195.8
simple_idct_armv6  time=1149.3
simple_idct_put_armv6  cache=no,  time=1411.9
simple_idct_put_armv6  cache=yes, time=1166.1
simple_idct_add_armv6  cache=no,  time=1428.3
simple_idct_add_armv6  cache=yes, time=1305.4