[FFmpeg-devel] [RFC] An improved implementation of ARMv5TE?IDCT (simple_idct_armv5te.S)
Michael Niedermayer
michaelni
Fri Sep 14 15:48:00 CEST 2007
Hi
On Thu, Sep 13, 2007 at 07:34:06PM +0300, Siarhei Siamashka wrote:
> On 12 September 2007, Michael Niedermayer wrote:
> > On Wed, Sep 12, 2007 at 11:31:02AM +0300, Siarhei Siamashka wrote:
> > > I have been working on improving IDCT performance for ARM for some time
> > > already and now think that it is more or less ready (performance wise) to
> > > get accepted upstream:
> > > https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv
> > >4l/simple_idct_armv5te.S?root=mplayer&view=markup
> >
> > [...]
> >
> > > So now the question is: how it would be best to integrate this idct code
> > > into ffmpeg? Should it replace the current simple_idct_armv5te.S file?
> >
> > yes if its always faster
>
> OK, I just thought about adding support for ARMv6 SIMD instructions later and
> other cpu specific optimizations to the same source file, generating optimized
> implementation for each of the instruction sets from a common template via
> macro expansion. In this sense, this file would not be '_armv5te' anymore :)
> But I guess, renaming or moving source files is not a problem and can be done
> any time in the future.
>
> A positive effect is the reuse of some common blocks of code, eliminating code
> duplication. A negative effect is a somewhat impaired readability. Having a
> proper regression tests suite is a requirement to keep such code maintainable.
>
> For example, there are three ways to perform pixel data cropping:
> 1. Similar to old armv5te code (3 instructions per pixel, 192 cycles overhead)
> 2. Use table lookups like in this patch (1 cycle per pixel, one register
> occupied to store address of cropping table, harder code scheduling because of
> high data load latency, potential data cache misses when performing lookups).
> Overall, it seems to work better than 1. in practice.
> 3. Use ARMv6 instructions (no disadvantages)
>
> Addition of ARMv6 instructions use for pixel data cropping to 0-255 range
> should be not difficult.
>
> > > Could you please review the copyright part to check if it is ok?
> >
> > is there anything special on it? it looks like normal LGPL
> >
> > > I can provide a patch with omitted experimental prefetch code, all the
> > > globals getting 'ff_' prefix, use of 'ff_cropTbl' instead of keeping its
> > > own copy of cropping table
> >
> > patch welcome
> >
> > > (by the way, it would be nice to extend MAX_NEX_CROP to 2048
> > > as idct can produce results in +-2K range when feeded with completely
> > > random data on input).
> >
> > patch welcome
>
> Patch attached. In order not to break badly on possible MAX_NEG_CROP
> changes in the future, header file 'dsputil.h' was split into two
> parts: 'dsputil.h' and 'dsputil_asm.h'. The latter is supposed to contain
just add a commet to the #define saying that the ARM code needs to be updated
if the value is changed
[...]
> Index: libavcodec/armv4l/simple_idct_armv5te.S
> ===================================================================
> --- libavcodec/armv4l/simple_idct_armv5te.S (revision 10483)
> +++ libavcodec/armv4l/simple_idct_armv5te.S (working copy)
if the files have nothing in common besides the license headers then please
dont diff them against each other
[...]
> +/*
> + * a local pool with 64-bit constants for 'idct_rows_armv5te' function,
> + * we align it at 16 byte boundary in order to ensure that it does not cross
> + * cache line boundary (occupies only a single cache line)
> + */
> + .balign 16
ASMALIGN()
> +w2266idct_rows_armv5te:
> + .long W22
> + .long W66
> +w1357idct_rows_armv5te:
> + .long W13
> + .long W57
> +
> +/*
> + * A rows processing function. Benchmarks on a few video files show that
> + * about 80-90% of uses of this function have all rows empty except for
> + * the row[0].
> + *
> + * On entry:
> + * a1 - row address
> + * lr - return address
> + *
> + * On exit:
> + * a1 - row address
> + *
> + * Registers usage within this function:
> + * a1 - row address
> + * a2 - temporary register
> + * v5, v6, v7, v8 - row data
> + * v1, v2, v3, v4 - A0, A1, A2 and A3 variables
> + * a3, a4 - used for loading constants
> + * ip - temporary register
> + * lr - temporary register, also holds initial row address value
> + * to check end of loop condition
> + */
> + .balign 32
why 32, didnt you just say cache lines are 16 ?
[...]
> + smlabb v4, a3, v6, v1 /* v4 = v1 - W2*row[2] */
> + smlabb v3, a4, v6, v1 /* v3 = v1 - W6*row[2] */
> + smlatb v2, a4, v6, v1 /* v2 = v1 + W6*row[2] */
> + smlatb v1, a3, v6, v1 /* v1 = v1 + W2*row[2] */
>
[---]
> + smlabb v4, a4, v8, v4 /* v4 -= W6*row[6] */
> + smlatb v3, a3, v8, v3 /* v3 += W2*row[6] */
> + smlabb v2, a3, v8, v2 /* v2 -= W2*row[6] */
> + smlatb v1, a4, v8, v1 /* v1 += W6*row[6] */
> + ldrd a3, w1357idct_rows_armv5te /* a3 = W1 | (W3 << 16) */
> + /* a4 = W5 | (W7 << 16) */
>
[---]
> + smlatb v4, a2, v7, v4 /* v4 += W4*row[4] */
> + smlabb v3, a2, v7, v3 /* v3 -= W4*row[4] */
> + smlabb v2, a2, v7, v2 /* v2 -= W4*row[4] */
> + smlatb v1, a2, v7, v1 /* v1 += W4*row[4] */
>
i think this can be implemented in fewer instructions, someting based on:
v2 = v1 - W4*row[4]
v1 = v1 + W4*row[4]
v3 = v2 - W6*row[2]
v4 = v1 - W2*row[2]
v3 += W2*row[6]
v4 -= W6*row[6]
v2 = 2*v2 - v3
v1 = 2*v1 - v4
or you could check that the rows are non zero before multiplying and adding
them
same applies to the column code
btw do you have a refereence which lists the number of cpu cycles per
instruction?
[...]
> + smulbt a2, a3, v5 /* b0 = W1*row[1] */
> + smultt ip, a3, v5 /* tmp = W3*row[1] */
> + smultt lr, a4, v6 /* -b1 = W7*row[3] */
> + smlatt a2, a3, v6, a2 /* b0 += W3*row[3] */
> + smlabt lr, a3, v7, lr /* -b1 += W1*row[5] */
> + smlabt a2, a4, v7, a2 /* b0 += W5*row[5] */
> + smlabt lr, a4, v8, lr /* -b1 += W5*row[7] */
> + smlatt a2, a4, v8, a2 /* b0 += W7*row[7] */
> + sub lr, ip, lr /* b1 = -b1 - tmp */
>
> - ldr pc, [sp], #4
> + /* B0 is now calculated (a2), B1 is now calculated (lr) */
>
[---]
> + add ip, v1, a2 /* ip = (A0 + B0) */
> + sub a2, v1, a2 /* a2 = (A0 - B0) */
> + mov ip, ip, asr #ROW_SHIFT
> + mov a2, a2, asr #ROW_SHIFT
> + strh ip, [a1, #0] /* row[0] = (A0 + B0) >> ROW_SHIFT */
> + strh a2, [a1, #14] /* row[7] = (A0 - B0) >> ROW_SHIFT */
>
> - ldr pc, [sp], #4
> + ldr v1, m51 /* v1 = ((-W5 & 0xFFFF) | ((-W1 & 0xFFFF) << 16)) */
> +
> + add ip, v2, lr /* ip = (A1 + B1) */
> + sub a2, v2, lr /* ip = (A1 - B1) */
> + mov ip, ip, asr #ROW_SHIFT
> + mov a2, a2, asr #ROW_SHIFT
> + strh ip, [a1, #2] /* row[1] = (A1 + B1) >> ROW_SHIFT */
> + strh a2, [a1, #12] /* row[6] = (A1 - B1) >> ROW_SHIFT */
> +
> + smulbt a2, a4, v5 /* b2 = W5*row[1] */
> + smultt v2, a4, v5 /* b3 = W7*row[1] */
> + smlatt a2, v1, v6, a2 /* b2 -= W1*row[3] */
> + smlatt v2, a3, v7, v2 /* b3 += W3*row[5] */
> + smlatt a2, a4, v7, a2 /* b2 += W7*row[5] */
> + smlatt v2, v1, v8, v2 /* b3 -= W1*row[7] */
> + smlatt a2, a3, v8, a2 /* b2 += W3*row[7] */
> + smlabt v2, v1, v6, v2 /* b3 -= W5*row[3] */
somehow i suspect that some speed could be gained by checking row 3/5/7
for being zero?
[...]
> +/*
> + * Enforce 8 byte stack alignment if it is not provided by ABI. Used at the beginning
> + * of global functions. If stack is not properly aligned, real return address is
> + * pushed to stack (thus fixing stack alignment) and lr register is set to a thunk
> + * function 'unaligned_return_thunk_armv5te' which is responsible for providing
> + * correct return from the function in this case.
> + */
> + .macro idct_stackalign_armv5te
> +#ifndef DWORD_ALIGNED_STACK
> + tst sp, #4
> + strne lr, [sp, #-4]!
> + adrne lr, unaligned_return_thunk_armv5te
> #endif
> + .endm
the compiler has to maintain an properly aligned stack and if needed has
to align it on entry to libavcodec (avcodec_decode_video() and such)
its not acceptable to realign the stack in the inner loop calling the
idct
[...]
> Index: libavcodec/armv4l/dsputil_arm.c
> ===================================================================
> --- libavcodec/armv4l/dsputil_arm.c (revision 10483)
> +++ libavcodec/armv4l/dsputil_arm.c (working copy)
> @@ -29,11 +29,11 @@
> extern void j_rev_dct_ARM(DCTELEM *data);
> extern void simple_idct_ARM(DCTELEM *data);
>
> -extern void simple_idct_armv5te(DCTELEM *data);
> -extern void simple_idct_put_armv5te(uint8_t *dest, int line_size,
> - DCTELEM *data);
> -extern void simple_idct_add_armv5te(uint8_t *dest, int line_size,
> - DCTELEM *data);
> +extern void ff_simple_idct_armv5te(DCTELEM *data);
> +extern void ff_simple_idct_put_armv5te(uint8_t *dest, int line_size,
> + DCTELEM *data);
> +extern void ff_simple_idct_add_armv5te(uint8_t *dest, int line_size,
> + DCTELEM *data);
this renaming belongs to a seperate patch
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
When you are offended at any man's fault, turn to yourself and study your
own failings. Then you will forget your anger. -- Epictetus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070914/f4e88d33/attachment.pgp>
More information about the ffmpeg-devel
mailing list