[Ffmpeg-devel] [PATCH] fix mpegaudiodec on ARM and benchmark

Tue Sep 12 21:15:51 CEST 2006

On Thursday 24 August 2006 01:08, Aurelien Jacobs wrote:

...

> > and another idea, try to set -mcpu -march -mtune correctly for the cpu
>
> When setting -march=armv4 or armv4t or armv5 or armv5t it don't even
> compile:
>
> arm-linux-gnu-gcc -DHAVE_AV_CONFIG_H -I.. -I../libavutil
> -Wdeclaration-after-statement -march=armv5t -D_REENTRANT -I/usr/include
> -I/usr/src/DVB/ost/include -I/usr/include/dxr2 -I/usr/local/include/cdda
> -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_ISOC9X_SOURCE    -c -o
> armv4l/dsputil_arm_s.o armv4l/dsputil_arm_s.S armv4l/dsputil_arm_s.S:
> Assembler messages:
> armv4l/dsputil_arm_s.S:77: Error: selected processor does not support `pld
> [r1]' armv4l/dsputil_arm_s.S:88: Error: selected processor does not support
> `pld [r1]' [...]

Hmm, that's interesting. The use of 'pld' instruction means that ffmpeg
already  unconditionally requires armv5te instruction set support.

And it can possibly mean that there are no armv4 users left already as nobody
complained. Is it ok to assume that armv5te code can be added without any
special configuration option necessary?

Also it is not quite related to ffmpeg, but mplayer can't be configured to use
HAVE_IWMMXT, so it does not use full optimizations on xscale cpu such as
yours, maybe it is a good idea to fix it?

> Setting -march=armv5te (which is exactly what my Xscale is) is quite
> slower, I don't understand why:
> BENCHMARKs: VC:   0.000s VO:   0.000s A: 206.553s Sys:   0.438s =  206.991s
> BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.7882% Sys:  0.2118% = 100.0000%

...

If I get it right, armv5te is instruction set and xscale is cpu architecture
which supports this instruction set.

For example my cpu also supports armv5te and its description is here (along
with the information about instruction timings):
http://www.arm.com/pdfs/DVI0035B_926_PO.pdf

Your cpu is xscale and its description is probably here:
http://www.intel.com/design/intelxscale/273473.htm

As you see, optimization strategies may be a bit different, intel cpu has a
longer pipeline (7 stages vs. 5) and has some extra performance penalties if
instructions are not properly ordered.

As for benchmarking, I only noticed a small perfomance improvement when using
low quality configure option, but your numbers show a huge performance boost.
Maybe xscale is just much worse at doing long multiplies (32bit*32bit->64bit)?
Probably it is a good idea to find some mp3 file freely available for download
and run benchmarks using it? So that performance of arm926ej-s and xscale
could be directly compared on ffmp3 decoder.

By the way, there seems to be a huge performance regression on x86 for ffmp3
decoder when switching from gcc 3.4.6 to gcc 4.1.1 (time for deconding grows
roughly from 3.5 seconds to about 5 seconds in my tests).