[FFmpeg-devel] [PATCH] Some ARM VFP optimizations (vector_fmul, vector_fmul_reverse, float_to_int16)

Siarhei Siamashka siarhei.siamashka
Sun Apr 20 22:33:43 CEST 2008

On Sunday 20 April 2008, matthieu castet wrote:
> Siarhei Siamashka wrote:
> > Also my additional test program 'test-vfp.c' from
> > https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/test
> >s/?root=mplayer runs successfully and verifies performance, correctness
> > and absence of any incorrect memory accesses outside memory buffers.
> BTW do you plan to resubmit your armv5te optimisations [1] ?
> [1] http://article.gmane.org/gmane.comp.video.ffmpeg.devel/56672

Yes, my intention is to sumbit all the ARM optimizations to FFmpeg
and MPlayer eventually. It may take some time though.

Regarding this ARMv5TE optimized IDCT. It is a refined version of ARMv5TE 
IDCT from FFmpeg, initially implemented by M?ns Rullg?rd. Now almost all the
code was modified (so making diff is useless, though the general structure is
the same). The changes include:
1. Introduced the use of negative constants. ARMv5TE only supports fused
multiply&accumulate instructions, but not multiply&subtract. Older 
code used more instructions to do the same job (separate multiply and 
subtract instructions), new code uses multiply&accumulate with negative
2. Instructions scheduling optimized for longer pipeline (ARM11, XScale),
better registers allocation. The same code is still optimal for ARM9E that 
has shorter pipeline.
3. Dual loads using LDRD instructions (reading two 32-bit registers per cycle)
for ARM11 and XScale. ARM9E also supports LDRD instruction, but it takes 2
cycles, so this optimization does not provide any improvement for ARM9E cores.
4. Saturation using table lookups (it works faster)
5. Better shortcuts for zero coefficients and a tweak for nonoptimal code
chunk identified by Michael Niedermayer when reviewing this code (the same
problem is still unfixed in current FFMpeg ARMv5TE IDCT). 

This all together provides a major speedup, especially when run on ARM11 and
XScale cores. All the benchmarks have been posted long ago.

Now about the problems why this code is still not in FFmpeg SVN. The most
important problem that stalled everything is 64-bit alignment. In order to
perform dual loads, data needs to be 64-bit (8 byte) aligned. This applies
to constants loading and accessing temporary data on stack. Constants loading
requires some kind of '.align' directive in the code. The problem is that
'.align' is nonportable and '.balign' or '.p2align' are not supported by 
broken Apple toolchain on x86 and apparently on ARM too (actually it has a 
lot more issues in addition to this one). One more problem is related to
stack, because legacy systems and toolchains (those that use OABI) support 
only 4 byte stack alignment. There was a tweak in the code that performs 
stack realignment when IDCT code is built for these broken legacy systems 
with absolutely no overhead on modern systems (this code does not get compiled
in for modern EABI systems that already provide 64-bit stack alignment), but
it was rejected by Michael Niedermayer. I strongly disagree with this decision
and breaking support for legacy systems for no reason is not a valid option
for me.

For this 64-bit alignment stuff, it is just possible to get rid of LDRD
instructions replacing them by pairs of normal loads and the problem will be
solved. That will have no impact on ARM9E. Technically, ARM11 needs ARMv6
optimizations anyway (and it has them) and XScale needs IWMMXT optimizations
(except for PXA255 cores that do not support IWMMXT), so it does not look like
a big issue. But http://www.angstrom-distribution.org/ currently uses this 
IDCT for their MPlayer package and it does provide some really visible
improvement over current IDCT. Also I initially planned to do some refinements
for current ARMv6 IDCT from FFMpeg (I expect that it can be also improved a
lot), but these optimizations would also require dual loads and it brings
us to the same alignment problem. Anticipating one more round of
flames/discussions in the list makes me feel uncomfortable.

The last (actually minor) problem is related to crop table that is defined in
C code and used from assembly code. If anybody would ever change MAX_NEG_CROP
constant in C code for whatever reason, assembly implementation of ARMv5TE
IDCT will be silently broken (it will compile, but will break at runtime). I
would prefer to have compilation failure here. A good example is lowres
decoding support that is now broken for many less popular architectures
(lowres decoding was added at some time in the past, but nobody tweaked sparc
or alpha code to properly support it). Another option (which I initially used)
is just to have a copy of crop table stored in assembly source file, it will
take extra space, but will make code reliable and not dependent on any other
stuff from outside.

I really would like to hear some constructive feedback instead of "this is
unacceptable" with no alternatives suggested :)

To sum it up, I would like to ask you to allow me to keep '.balign' directives
(broken toolchains will just have compilation failure, that's not so bad, at
least we will have a chance to see if it breaks in practice for anyone and 
what toolchain they use).

I also still see absolutely no problem with stack realignment code, it
takes only a few lines in the source and DOES NOT expand to anything on 
modern systems which use EABI. So all the talks about the extra code in the
inner loop do not make any sense. If you still want to screw up ARM legacy
systems which can't upgrade their broken toolchains, why do you still care
about gcc 2.95 support? I don't see the logic...

Regarding the crop table, any solution is fine for me, even if I don't like it
myself :) If you confirm your old decision (duplicate MAX_NEG_CROP define in C
and assembly sources, adding comments that one needs to change these constants
simultaneously), it's ok.

Best regards,
Siarhei Siamashka

More information about the ffmpeg-devel mailing list