[FFmpeg-devel] [RFC] An improved implementation of ARMv5TE IDCT (simple_idct_armv5te.S)
Siarhei Siamashka
siarhei.siamashka
Wed Sep 12 10:31:02 CEST 2007
Hello all,
I have been working on improving IDCT performance for ARM for some time
already and now think that it is more or less ready (performance wise) to
get accepted upstream:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv4l/simple_idct_armv5te.S?root=mplayer&view=markup
The following simple test program is used to ensure that new idct provides
results that are bit identical to the existing implementation and the
performance is better for all the typical and also corner cases:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/tests/test-idct.c?root=mplayer&view=markup
More detailed instructions about running this test and interpreting its
results can be found in this forum (I asked XScale users to run some
benchmarks in order to confirm performance improvement on XScale and
avoid any possible regressions):
http://www.oesf.org/forums/index.php?showtopic=22280&st=50
The current results are that IDCT performance got improved at about ~20% on
ARM9E and is more than 1.5x faster on ARM11 and XScale cores. Naturally, there
is a visible performance improvement on mpeg4 videos decoding as well.
An interesting observation is that this ARMv5TE code is practically as
fast as current ARMv6 implementation when considering only the number
crunching part, but falls behind on storing pixel data to memory as
ARMv6 can do cropping to 0-255 range much more efficiently because of
having special instructions for that.
Current speedup is provided by scheduling instructions so that there are
(almost) no pipeline stalls even on longer pipeline cores ARM11 and
XScale with higher latencies. Also 64-bit loads are heavily used (ARM11
and XScale can load two 32-bit registers per cycle). The extensive use
of MAC instructions combined with fast constants loading improves
performance even on ARM9E. One more interesting performance related
thing to note is that ARM cores do not allocate cache line on write
misses, but use write buffer which can hold some number of pending
write requests before they can be flushed to memory. When target
buffer for 'simple_idct_put_armv5te' is not in cache, processing
two columns at a time and storing 32 16-bit values to memory instead
of performing 64 8-bit writes, prevents write buffer overflow,
pipeline stall and performance loss as a result (can be seen at
'simple_idct_put_armv6' benchmark at the end of this post). By the way,
setting 'readable' variable to 0 in 'MPV_decode_mb_internal' helps
to avoid this problem, improves performance and can be used as a
workaround. I have plans for adding some more substantial cache related
optimizations later and can share some thoughts about how it could be
done.
So now the question is: how it would be best to integrate this idct code
into ffmpeg? Should it replace the current simple_idct_armv5te.S file?
Could you please review the copyright part to check if it is ok?
I can provide a patch with omitted experimental prefetch code, all the globals
getting 'ff_' prefix, use of 'ff_cropTbl' instead of keeping its own copy of
cropping table (by the way, it would be nice to extend MAX_NEX_CROP to 2048
as idct can produce results in +-2K range when feeded with completely random
data on input).
Here are the current synthetic test results:
=== Testing on Nokia 770 (ARM9E core, 252MHz, 16K of data cache) ===
./test-idct --freq=252
avg=-0.08, stddev=36.96, min=-168.00, max=149.00
Assuming cpu clock frequency 252MHz (ARMv6 disabled)
Please be patient and wait for the results, test requires quite a lot of time
to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te time=856.4
simple_idct_put_armv5te cache=no, time=1041.3
simple_idct_put_armv5te cache=yes, time=1058.9
simple_idct_add_armv5te cache=no, time=1336.5
simple_idct_add_armv5te cache=yes, time=1215.3
simple_idct_add_pf_pld_armv5te cache=no, time=1340.3
simple_idct_add_pf_pld_armv5te cache=yes, time=1220.4
simple_idct_armv5te_ref time=1059.1
simple_idct_put_armv5te_ref cache=no, time=1283.0
simple_idct_put_armv5te_ref cache=yes, time=1275.9
simple_idct_add_armv5te_ref cache=no, time=1625.9
simple_idct_add_armv5te_ref cache=yes, time=1469.0
--- benchmarking with random idct coefficients ---
simple_idct_armv5te time=1391.1
simple_idct_put_armv5te cache=no, time=1607.3
simple_idct_put_armv5te cache=yes, time=1602.3
simple_idct_add_armv5te cache=no, time=1913.9
simple_idct_add_armv5te cache=yes, time=1800.2
simple_idct_add_pf_pld_armv5te cache=no, time=1966.6
simple_idct_add_pf_pld_armv5te cache=yes, time=1835.5
simple_idct_armv5te_ref time=1716.4
simple_idct_put_armv5te_ref cache=no, time=1953.5
simple_idct_put_armv5te_ref cache=yes, time=1918.2
simple_idct_add_armv5te_ref cache=no, time=2282.7
simple_idct_add_armv5te_ref cache=yes, time=2129.8
=== Testing on Nokia N800 (ARM11 core, 330MHz, 32K of data cache) ===
./test-idct --freq=330 --enable-armv6
avg=-0.08, stddev=36.96, min=-168.00, max=149.00
Assuming cpu clock frequency 330MHz (ARMv6 enabled)
Please be patient and wait for the results, test requires quite a lot of time
to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te time=686.2
simple_idct_put_armv5te cache=no, time=804.9
simple_idct_put_armv5te cache=yes, time=785.3
simple_idct_add_armv5te cache=no, time=998.9
simple_idct_add_armv5te cache=yes, time=871.7
simple_idct_add_pf_pld_armv5te cache=no, time=972.1
simple_idct_add_pf_pld_armv5te cache=yes, time=879.1
simple_idct_armv5te_ref time=1092.1
simple_idct_put_armv5te_ref cache=no, time=1307.3
simple_idct_put_armv5te_ref cache=yes, time=1287.1
simple_idct_add_armv5te_ref cache=no, time=1529.5
simple_idct_add_armv5te_ref cache=yes, time=1405.1
simple_idct_armv6 time=760.2
simple_idct_put_armv6 cache=no, time=1030.8
simple_idct_put_armv6 cache=yes, time=773.0
simple_idct_add_armv6 cache=no, time=1051.6
simple_idct_add_armv6 cache=yes, time=909.4
--- benchmarking with random idct coefficients ---
simple_idct_armv5te time=1153.5
simple_idct_put_armv5te cache=no, time=1273.6
simple_idct_put_armv5te cache=yes, time=1257.5
simple_idct_add_armv5te cache=no, time=1468.6
simple_idct_add_armv5te cache=yes, time=1345.7
simple_idct_add_pf_pld_armv5te cache=no, time=1402.8
simple_idct_add_pf_pld_armv5te cache=yes, time=1400.5
simple_idct_armv5te_ref time=1872.3
simple_idct_put_armv5te_ref cache=no, time=2100.2
simple_idct_put_armv5te_ref cache=yes, time=2079.0
simple_idct_add_armv5te_ref cache=no, time=2327.7
simple_idct_add_armv5te_ref cache=yes, time=2195.8
simple_idct_armv6 time=1149.3
simple_idct_put_armv6 cache=no, time=1411.9
simple_idct_put_armv6 cache=yes, time=1166.1
simple_idct_add_armv6 cache=no, time=1428.3
simple_idct_add_armv6 cache=yes, time=1305.4
More information about the ffmpeg-devel
mailing list