[Ffmpeg-devel] [PATCH] Optimization of 'dct_unquantize_h263_intra' for ARM (armv5te)

Siarhei Siamashka siarhei.siamashka
Tue Jan 2 18:13:57 CET 2007


On Tuesday 02 January 2007 04:23, you wrote:

Well, nevermind the question in my previous post:
http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2007-January/050356.html

I thought that this code could be improved by reducing the number of
instructions for loading and storing data (load and store values as 32-bit or
even 64-bit). But appears that performance improvement is minimal (only
loads can be optimized) and that all is not worth additional troubles.

So that patch (attached to my previous post) is final, and I think it is ready
for commit now :)

Verified it using some synthetic correctness/performance test program:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/tests/?root=mplayer

# ./test-unquantize
dct_unquantize_h263_helper_c time=0.07111 usec per element, 
or 17.8 cycles (250MHz), 29.6 cycles (416MHz)
dct_unquantize_h263_helper_armv5te time=0.03072 usec per element, 
or 7.7 cycles (250MHz), 12.8 cycles (416MHz)

It was tested on Nokia 770, so estimation of cpu cycles is valid for 250MHz.

So this ARM optimized code is twice faster than the code generated by gcc
4.1.1 (-march=armv5te -mtune=arm926ej-s -O3 -fomit-frame-pointer)

Also tested decoding (using mplayer from svn) of Doom trailer from:
http://www.divx.com/movies/detail.php?movieID=57&cID=1

Output of gprof before patch:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 10.81      9.73     9.73  5372100     0.00     0.00  mpeg4_decode_block
  9.07     17.90     8.17                             idct_col_put_armv5te
  7.41     24.57     6.67  1497822     0.00     0.00  
dct_unquantize_h263_intra_c
  6.83     30.72     6.15  1228760     0.01     0.02  ff_mpeg4_decode_mb
  6.77     36.82     6.10   911294     0.01     0.01  put_pixels16_c
  6.32     42.51     5.69  1074435     0.01     0.02  MPV_motion
  5.92     47.84     5.33                             idct_col_add_armv5te
  5.76     53.03     5.19  1228760     0.00     0.03  MPV_decode_mb
  3.75     56.41     3.38  1497822     0.00     0.00  mpeg4_pred_ac
  3.43     59.50     3.09                             put_pixels8_arm
  2.65     61.89     2.39                             idct_row_armv5te

Output of gprof after patch:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 11.07      9.52     9.52  5372100     0.00     0.00  mpeg4_decode_block
 10.20     18.29     8.77                             idct_col_put_armv5te
  7.22     24.50     6.21   911294     0.01     0.01  put_pixels16_c
  6.71     30.27     5.77  1074435     0.01     0.02  MPV_motion
  6.64     35.98     5.71  1228760     0.00     0.02  ff_mpeg4_decode_mb
  6.35     41.44     5.46                             idct_col_add_armv5te
  5.83     46.45     5.01  1228760     0.00     0.03  MPV_decode_mb
  4.41     50.24     3.79  1497822     0.00     0.00  
dct_unquantize_h263_intra_armv5te
  3.54     53.28     3.04  1497822     0.00     0.00  mpeg4_pred_ac
  3.48     56.27     2.99                             put_pixels8_arm
  3.09     58.93     2.66                             idct_row_armv5te

Also tested running mplayer with '-vo md5sum', results are identical.

It is a pity that ARMv5TE does not have SIMD instructions, it would be much
better to use SIMD for this code, but at least this ARM optimized function is
still a lot faster than gcc generated code and provides about ~3% overall
improvement on this video file.

PS. I have started a thread about ffmpeg optimizations for ARM in oesf.org
forum: http://www.oesf.org/forums/index.php?showtopic=22280
Maybe at least it will be possible to find some people willing to try
compiling and testing ffmpeg/mplayer on XScale processors :)




More information about the ffmpeg-devel mailing list