[FFmpeg-devel] [PATCH] Fix mm_flags, mm_support for ARM

Mon Jun 30 14:57:49 CEST 2008

On Saturday 28 June 2008, M?ns Rullg?rd wrote:
> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> > On Saturday 28 June 2008, M?ns Rullg?rd wrote:
> >> Michael Niedermayer <michaelni at gmx.at> writes:
> >> > Do we have someone who has a arm cpu and can look into the above
> >> > issue?
> >>
> >> I know exactly why it's different.  In simple_idct.c, the column
> >> transform contains these lines:
> >>
> >>         /* XXX: I did that only to give same values as previous code */
> >>         a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
> >>
> >> It's simpler to code that as a0 = W4 * col[0] + (1 << (COL_SHIFT-1)).
> >> Thinking about it, it only takes one more instruction on NEON, and
> >> I've fixed that in my tree.  With a little luck, the extra instruction
> >> can be dual-issued with something else.
> >
> > This part does not have any extra overhead in my finetuned version
> > of ARMv5TE IDCT:
> >
> >   ldr    v1, xxx         /* v1 = (((1<<(COL_SHIFT-1))/W4)*W4) */
> >   [some unrelated instructions to hide load latency]
> >   smlatt v2, a2, v4, v1  /* A0t = W4 * (col_t[0] +
> > ((1<<(COL_SHIFT-1))/W4)) */
> >
> > There is no reason why ARMv6 or NEON should have overhead too. So getting
> > bit-identical results to C simple_idct is possible without sacrificing
> > performance.
>
> ((1<<(COL_SHIFT-1))/W4)*W4 doesn't fit in 16 bits, so that method
> can't easily be used when everything else is in 16-bit vectors.

But (1 << (COL_SHIFT-1)) does not fit it either. I don't see any significant
difference between:

a0 = W4 * col[0] + (1 << (COL_SHIFT-1));

and

a0 = W4 * col[0] + ((1<<(COL_SHIFT-1))/W4)*W4;

Sure, the first one can be encoded as a constant immediate operand in ARM
instruction and the second one can't, but that's not a big deal (it can be 
loaded from memory) and is unlikely to be a problem for your NEON code.

In any case, you can't get away using only 16-bit values, multiplication
results should be still accumulated somewhere. The difference between
these two implementations is just an initial accumulator value. But of
course, I may be missing something. It would be nice to have a look at your
NEON code.

> Is your armv5te idct a total rewrite of what's in svn, or can the
> changes be broken into sensible steps? If the latter, please send a 
> patch series. 

Do you remember our chat in IRC after you added your implementation of 
ARMv5TE IDCT to SVN? I said that it was a nice improvement over all the
IDCT available at that point for ARM, but still far from optimal. You
said that further improvements would be welcome. So here we are.

I ended up mostly keeping all the code structure (I actually tried
to modify code to process 1 column per iteration, but processing 2
columns still was faster). But all the code had to be changed for
better instructions scheduling and registers allocation.

It can be artificially split into several patches, replacing all parts 
of your IDCT one at a time, but are these steps really sensible?

This further optimization of ARMv5TE IDCT is a result of work approved 
and encouraged by you, so you should have some responsibility too ;)

I wish we had a friendly community of ARM developers here, who would help
each other. After all, am I the only one here who is interested in getting
much faster ARMv5TE IDCT for FFmpeg?

Could you or anybody else having compatible ARM device just do some
benchmarking to confirm my results (I posted benchmarks here multiple 
times already). It would be a really good help. Because I feel that
some people here still doubt that it provides a major performance
improvement.

In order to do this, you need to:

1. Download this source file and drop it over current 'simple_idct_armv5te.S':
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv4l/simple_idct_armv5te.S?root=mplayer&view=markup
2. Change MAX_NEG_CROP in 'simple_idct_armv5te.S' to 1024
3. Compile FFmpeg and do some benchmarks

Once/if the performance improvement is confirmed, a help with integration
would be really needed. That's not a joke, I really fail to see any problems 
with the "balign/ASMALIGN/stack alignment" stuff, so I can't fix them. A good
example of a solution (a working patch) is very much welcome. 

Does NEON optimized code require more than 32-bit alignment for data and
stack? If yes, I could just wait a bit and use your NEON code as an example 
how do this stuff in FFmpeg-approved way.

> >> > Ideally would be the authors who claimed the code to be identical to
> >> > the C code ...
> >>
> >> I wrote the ARMv6 version, but I never made any such claim.  In fact,
> >> I believe I mentioned at the time that there was a slight difference.
> >>
> >> > If we have noone then we will likely have to disable these IDCTs. I do
> >> > not want to create files that turn green and pink unless they are
> >> > played on an ARM cpu ...
> >>
> >> I don't think the ARM CPUs where these apply will be used mostly for
> >> playback, not encoding, and on those machines every cycle counts.
> >
> > Yes, that was one of the reasons why I did not strongly insist on
> > disabling j_rev_dct_ARM that time (people could get a severe performance
> > regressions and complain about it) :)
> >
> > In any case, ARMv6 idct still needs heavy optimizations, it is not very
> > fast (on its target devices with ARM11 CPUs of course).
>
> Well, it's considerably faster than the C IDCT, but I'm not denying it
> could be improved.  Are you talking about sparse data handling, or
> something else?

It has quite a number of performance problems:
1. it's not using 64-bit load instructions (and dual loads are almost as
useful as dual multiplies for IDCT)
2. heavy functions call/ret overhead (ordinary loops are a lot faster)
3. sparse data handling

It can be clearly seen if you try to benchmark it against optimized 
ARMv5TE IDCT on ARM11, and ARMv6 IDCT fails to provide any significant
performance improvement (actually performance is roughly the same and
I can't say which one is faster). Also ARMv5TE optimized IDCT can be 
easily modified to use ARMv6 saturation instructions instead of table 
lookups. That should save at least 64 cycles per _add/_put function 
and I suspect that this hacked ARMv5TE IDCT would easily leap ahead 
and outperform your ARMv6 IDCT.

We have the following hardware features list:
"dual loads", "dual multiplies", "saturation"

Your IDCT uses only "dual multiplies" and "saturation"

ARMv5TE IDCT uses only "dual loads", but can be trivially modified to use
"saturation" too

Optimal IDCT implementation should use them all, we don't have spare CPU
cycles on embedded devices.

> Did you have any patches? 

I just know how to fix it :) But we are not finished with ARMv5TE yet...

> Remind me again what ARM devices you have.  I recently got my hands on
> a Cortex-A8 (TI OMAP3530), and it's more fun to play around with than
> the Nokia tablets.

Right now I only have ARM926EJ-S core in TI OMAP1710 (Nokia 770) and
ARM1136JF-S in TI OMAP2420 (Nokia N800). But I hope to get this one soon
(well, once they become available for general public): http://beagleboard.org/ 

Anybody else with ARM devices here (usable for FFmpeg development of course)? 

-- 
Best regards,
Siarhei Siamashka