[FFmpeg-devel] [RFC] An improved implementation of?ARMv5TE?IDCT (simple_idct_armv5te.S)
Michael Niedermayer
michaelni
Sat Sep 15 00:47:56 CEST 2007
Hi
On Fri, Sep 14, 2007 at 07:37:01PM +0300, Siarhei Siamashka wrote:
[...]
> > > +/*
> > > + * a local pool with 64-bit constants for 'idct_rows_armv5te' function,
> > > + * we align it at 16 byte boundary in order to ensure that it does not
> > > cross + * cache line boundary (occupies only a single cache line)
> > > + */
> > > + .balign 16
> >
> > ASMALIGN()
>
> Does it mean I should define it as a macro at top of simple_idct_armv5te.S
> and use it in the rest of the source file?
no i mean you should use the ASMALIGN macro as *align is not portable
different assemblers interpret it differently or dont support some forms
at all
i dont know if that is true for assemblers supporting ARM too but ...
[...]
>
> > or you could check that the rows are non zero before multiplying and adding
> > them
> > same applies to the column code
>
> Do you suggest something like I tried in one of the older revisions
> ('idct_row_full' and 'idct_row_partial' labels)?:
> https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv4l/simple_idct_armv5te.S?root=mplayer&rev=82&view=markup
>
> That's a somewhat gray area and the decision is not clear. As most
> videos contain a lot of empty rows (only row[0] is nonzero), code
> >from 'idct_row_partial' would be executed in only about 10% of cases.
> That branch saves us 16 cycles total. On the other hand, the conditional
> branch instruction to 'idct_row_partial' will be always executed and take
> at least 1 cycle. So while 16 cycles saving is still more than 10 times per
> 1 cycle, one more concern is code size. As you could see in my previous post
> with -O2 vs. -O3 benchmarks, mpeg4 decoder actually might be short on
> instructions cache size and increasing code size without a really good
> reason might be not a very good idea.
i dont think a single branch instruction to branch over the code will
significantly increase the code size
[...]
> >
> >
> > [...]
> >
> > > + smulbt a2, a3, v5 /* b0 = W1*row[1] */
> > > + smultt ip, a3, v5 /* tmp = W3*row[1] */
> > > + smultt lr, a4, v6 /* -b1 = W7*row[3] */
> > > + smlatt a2, a3, v6, a2 /* b0 += W3*row[3] */
> > > + smlabt lr, a3, v7, lr /* -b1 += W1*row[5] */
> > > + smlabt a2, a4, v7, a2 /* b0 += W5*row[5] */
> > > + smlabt lr, a4, v8, lr /* -b1 += W5*row[7] */
> > > + smlatt a2, a4, v8, a2 /* b0 += W7*row[7] */
> > > + sub lr, ip, lr /* b1 = -b1 - tmp */
> > >
> > > - ldr pc, [sp], #4
> > > + /* B0 is now calculated (a2), B1 is now calculated (lr) */
> >
> > [---]
> >
> > > + add ip, v1, a2 /* ip = (A0 + B0) */
> > > + sub a2, v1, a2 /* a2 = (A0 - B0) */
> > > + mov ip, ip, asr #ROW_SHIFT
> > > + mov a2, a2, asr #ROW_SHIFT
> > > + strh ip, [a1, #0] /* row[0] = (A0 + B0) >>
> > > ROW_SHIFT */ + strh a2, [a1, #14] /* row[7] = (A0 -
> > > B0) >> ROW_SHIFT */
> > >
> > > - ldr pc, [sp], #4
> > > + ldr v1, m51 /* v1 = ((-W5 & 0xFFFF) | ((-W1 &
> > > 0xFFFF) << 16)) */ +
> > > + add ip, v2, lr /* ip = (A1 + B1) */
> > > + sub a2, v2, lr /* ip = (A1 - B1) */
> > > + mov ip, ip, asr #ROW_SHIFT
> > > + mov a2, a2, asr #ROW_SHIFT
> > > + strh ip, [a1, #2] /* row[1] = (A1 + B1) >>
> > > ROW_SHIFT */ + strh a2, [a1, #12] /* row[6] = (A1 -
> > > B1) >> ROW_SHIFT */ +
> > > + smulbt a2, a4, v5 /* b2 = W5*row[1] */
> > > + smultt v2, a4, v5 /* b3 = W7*row[1] */
> > > + smlatt a2, v1, v6, a2 /* b2 -= W1*row[3] */
> > > + smlatt v2, a3, v7, v2 /* b3 += W3*row[5] */
> > > + smlatt a2, a4, v7, a2 /* b2 += W7*row[5] */
> > > + smlatt v2, v1, v8, v2 /* b3 -= W1*row[7] */
> > > + smlatt a2, a3, v8, a2 /* b2 += W3*row[7] */
> > > + smlabt v2, v1, v6, v2 /* b3 -= W5*row[3] */
> >
> > somehow i suspect that some speed could be gained by checking row 3/5/7
> > for being zero?
>
> Checking values for zero and branching seems to be quite expensive when
> inserted in the middle of code, multiplying is very fast. A zero check and a
> conditional branch (50%/50% probability) would be only useful on ARM9E if it
> lets to skip over 7 or more multiply instructions.
columns have all coefficients 0 except the first in ~90% of the cases
[...]
> > [...]
> >
> > > +/*
> > > + * Enforce 8 byte stack alignment if it is not provided by ABI. Used at
> > > the beginning + * of global functions. If stack is not properly aligned,
> > > real return address is + * pushed to stack (thus fixing stack alignment)
> > > and lr register is set to a thunk + * function
> > > 'unaligned_return_thunk_armv5te' which is responsible for providing + *
> > > correct return from the function in this case.
> > > + */
> > > + .macro idct_stackalign_armv5te
> > > +#ifndef DWORD_ALIGNED_STACK
> > > + tst sp, #4
> > > + strne lr, [sp, #-4]!
> > > + adrne lr, unaligned_return_thunk_armv5te
> > > #endif
> > > + .endm
> >
> > the compiler has to maintain an properly aligned stack and if needed has
> > to align it on entry to libavcodec (avcodec_decode_video() and such)
> > its not acceptable to realign the stack in the inner loop calling the
> > idct
>
> The compiler has to maintain ABI specified stack alignment (either OABI or
> EABI for ARM). If ABI specifies 4-byte alignment (OABI case), we have to
> either insert some code to manually align stack, or not use this function at
> all. Stack alignment at 8-byte boundary can be performed really fast with 3
> instructions (3 cycles) on entry and one instruction at exit (5 cycles) in the
> case when stack alignment was needed. The overhead of manual stack alignment
> for OABI is 3+5/2 on average and is equal to 5.5 cycles (ARM9E). That's quite
> cheap.
as ive said, this code does NOT belong in the inner loop, its the compilers
job to maintain the user specified alignment (see -mpreferred-stack-boundary
in the case of gcc)
stack realignment due to the ABI providing less is only needed on functions
called from the outside (main(), avcodec_encode_video() ...)
if gcc doesn maintain the alignment requested by the user its buggy
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Its not that you shouldnt use gotos but rather that you should write
readable code and code with gotos often but not always is less readable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070915/1f5a8378/attachment.pgp>
More information about the ffmpeg-devel
mailing list