[FFmpeg-devel] [RFC] An improved implementation?of?ARMv5TE?IDCT (simple_idct_armv5te.S)

Michael Niedermayer michaelni
Sat Sep 15 13:37:16 CEST 2007


Hi

On Sat, Sep 15, 2007 at 11:35:43AM +0300, Siarhei Siamashka wrote:
> On 15 September 2007, Michael Niedermayer wrote:
> > Hi
> >
> > On Fri, Sep 14, 2007 at 07:37:01PM +0300, Siarhei Siamashka wrote:
> > [...]
> >
> > > > > +/*
> > > > > + * a local pool with 64-bit constants for 'idct_rows_armv5te'
> > > > > function, + * we align it at 16 byte boundary in order to ensure that
> > > > > it does not cross + * cache line boundary (occupies only a single
> > > > > cache line) + */
> > > > > +        .balign 16
> > > >
> > > > ASMALIGN()
> > >
> > > Does it mean I should define it as a macro at top of
> > > simple_idct_armv5te.S and use it in the rest of the source file?
> >
> > no i mean you should use the ASMALIGN macro as *align is not portable
> > different assemblers interpret it differently or dont support some forms
> > at all
> > i dont know if that is true for assemblers supporting ARM too but ...
> 
> I'm sorry, but please answer my original question. This ASMALIGN() macro
> should be defined somewhere first, otherwise such statement will just result
> in a syntax error. You already rejected a common header file for assembly
> sources, so the only option to define this macro is at the top of
> simple_idct_armv5te.S source. Is that correct? I just wanted to confirm it.
> I'm fine with any solution as long as it compiles and works right.

its defined by configure/config.h
(though i see now its a string which maybe cant be used like that in .S
if so just add a ASMALIGN_S() to configure which can be used in .S files


> 
> By the way, you can check 'info as' for the information about '.align'
> directive. You will notice that '.balign' is a portable way to specify
> alignment across all platforms, as long as gnu assembler is used.
> 
> Do you want to be able to compile the same *.S source with a different
> assembler? In this case this source still has lots of other portability
> issues, namely gnu-specific macro definitions and the use of C-style
> preprocessor.

i just know that .align did cause problems with some assemblers, i dont
remember the details but as a macro exists for it it should be used, its
more consistent and more flexible


> 
> > > > > +        smulbt a2, a4, v5              /* b2 = W5*row[1] */
> > > > > +        smultt v2, a4, v5              /* b3 = W7*row[1] */
> > > > > +        smlatt a2, v1, v6, a2          /* b2 -= W1*row[3] */
> > > > > +        smlatt v2, a3, v7, v2          /* b3 += W3*row[5] */
> > > > > +        smlatt a2, a4, v7, a2          /* b2 += W7*row[5] */
> > > > > +        smlatt v2, v1, v8, v2          /* b3 -= W1*row[7] */
> > > > > +        smlatt a2, a3, v8, a2          /* b2 += W3*row[7] */
> > > > > +        smlabt v2, v1, v6, v2          /* b3 -= W5*row[3] */
> > > >
> > > > somehow i suspect that some speed could be gained by checking  row
> > > > 3/5/7 for being zero?
> > >
> > > Checking values for zero and branching seems to be quite expensive when
> > > inserted in the middle of code, multiplying is very fast. A zero check
> > > and a conditional branch (50%/50% probability) would be only useful on
> > > ARM9E if it lets to skip over 7 or more multiply instructions.
> >
> > columns have all coefficients 0 except the first in ~90% of the cases
> 
> This special case is already handled. Just check the main loop in
> 'idct_rows_armv5te' function.

no, i talk about the column code


[...]
> > > > > +/*
> > > > > + * Enforce 8 byte stack alignment if it is not provided by ABI. Used
> > > > > at the beginning + * of global functions. If stack is not properly
> > > > > aligned, real return address is + * pushed to stack (thus fixing
> > > > > stack alignment) and lr register is set to a thunk + * function
> > > > > 'unaligned_return_thunk_armv5te' which is responsible for providing +
> > > > > * correct return from the function in this case.
> > > > > + */
> > > > > +        .macro idct_stackalign_armv5te
> > > > > +#ifndef DWORD_ALIGNED_STACK
> > > > > +        tst    sp, #4
> > > > > +        strne  lr, [sp, #-4]!
> > > > > +        adrne  lr, unaligned_return_thunk_armv5te
> > > > >  #endif
> > > > > +        .endm
> > > >
> > > > the compiler has to maintain an properly aligned stack and if needed
> > > > has to align it on entry to libavcodec (avcodec_decode_video() and
> > > > such) its not acceptable to realign the stack in the inner loop calling
> > > > the idct
> > >
> > > The compiler has to maintain ABI specified stack alignment (either OABI
> > > or EABI for ARM). If ABI specifies 4-byte alignment (OABI case), we have
> > > to either insert some code to manually align stack, or not use this
> > > function at all. Stack alignment at 8-byte boundary can be performed
> > > really fast with 3 instructions (3 cycles) on entry and one instruction
> > > at exit (5 cycles) in the case when stack alignment was needed. The
> > > overhead of manual stack alignment for OABI is 3+5/2 on average and is
> > > equal to 5.5 cycles (ARM9E). That's quite cheap.
> >
> > as ive said, this code does NOT belong in the inner loop, 
> 
> It depends on what you consider an inner loop. A 600+ cycles function does 
> not look like an inner loop to me. And 3 cycles is a minor overhead (if the
> stack is already 64-bit aligned), which is only compiled in for legacy ABI,
> with no impact on EABI targeted builds at all. 

its not just the idct which could benefit from 64 bit loads, theres alot more
code which could and which is nowhere near as time consuming as the idct,
just look at the motion compensation code for example
you dont seriously suggest that we realign the stack on entry to every
"optimized" function?

and yes the qpel MC code does use the stack as temporary storage


> 
> > its the compilers 
> > job to maintain the user specified alignment (see
> > -mpreferred-stack-boundary in the case of gcc)
> 
> I know about the options supported by gcc, including this one in particular.
> 
> But could you please point me to an exact place where such compilation option
> is specified for building ffmpeg? And for other projects using ffmpeg as part
> of their source tree (mplayer for example)?

if you add code needing 64bit aligned stack, so you will have to add it to
ffmpegs configure


> 
> I have tracked this mailing list for a long time already and have seen quite a
> number of posts about windows related improper stack alignment breakages. I
> must say, I would not like to see something like this happening to ARM as
> well.

gcc, IIRC 4.2 has an attribute for functions to realign the stack
also for gcc prior to that your realign code which you put into the innermost
loop simply can be put into a seperate function which could realign the stack
on calls to avcodec_encode_video() and similar


> 
> In addition, you seem to care about portability. A random compiler X does not
> have to provide an option to support arbitrary stack alignment specified by
> user. There is a reason why such a thing as ABI was invented.

i thought you just said you wrote the asm in a totally non portable way?
so how is the alignment going to make a difference for a different tool chain
which wont support the asm anyway?

and a compiler which supports our asm has to support maintaining the needed
alignment
the arm idct is rather special that a simple realign in the optimized asm code
corrects the problem, with inline asm and arrays defined in C this just wont
work


> 
> > stack realignment due to the ABI providing less is only needed on functions
> > called from the outside (main(), avcodec_encode_video() ...)
> >
> > if gcc doesn maintain the alignment requested by the user its buggy
> 
> Anyway, I'm fine with removing this check as long as a disclaimer is added
> that I'm not taking any responsibility for the stack alignment related
> breakages, and the removal of the stack alignment fixup was one of the
> requirements for this code to be accepted to ffmpeg.

feel free to add a disclaimer saying that moving the stack realignment
code from inner loops to libavcodec entry was a requirement

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Dictatorship naturally arises out of democracy, and the most aggravated
form of tyranny and slavery out of the most extreme liberty. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070915/80867ec3/attachment.pgp>



More information about the ffmpeg-devel mailing list