[FFmpeg-devel] [RFC] An improved implementation of ARMv5TE?IDCT (simple_idct_armv5te.S)

Fri Sep 14 15:48:00 CEST 2007

Hi

On Thu, Sep 13, 2007 at 07:34:06PM +0300, Siarhei Siamashka wrote:
> On 12 September 2007, Michael Niedermayer wrote:
> > On Wed, Sep 12, 2007 at 11:31:02AM +0300, Siarhei Siamashka wrote:
> > > I have been working on improving IDCT performance for ARM for some time
> > > already and now think that it is more or less ready (performance wise) to
> > > get accepted upstream:
> > > https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv
> > >4l/simple_idct_armv5te.S?root=mplayer&view=markup
> >
> > [...]
> >
> > > So now the question is: how it would be best to integrate this idct code
> > > into ffmpeg? Should it replace the current simple_idct_armv5te.S file?
> >
> > yes if its always faster
> 
> OK, I just thought about adding support for ARMv6 SIMD instructions later and
> other cpu specific optimizations to the same source file, generating optimized
> implementation for each of the instruction sets from a common template via
> macro expansion. In this sense, this file would not be '_armv5te' anymore :)
> But I guess, renaming or moving source files is not a problem and can be done
> any time in the future.
> 
> A positive effect is the reuse of some common blocks of code, eliminating code
> duplication. A negative effect is a somewhat impaired readability. Having a
> proper regression tests suite is a requirement to keep such code maintainable.
> 
> For example, there are three ways to perform pixel data cropping:
> 1. Similar to old armv5te code (3 instructions per pixel, 192 cycles overhead)
> 2. Use table lookups like in this patch (1 cycle per pixel, one register
> occupied to store address of cropping table, harder code scheduling because of
> high data load latency, potential data cache misses when performing lookups).
> Overall, it seems to work better than 1. in practice.
> 3. Use ARMv6 instructions (no disadvantages)
> 
> Addition of ARMv6 instructions use for pixel data cropping to 0-255 range
> should be not difficult.
> 
> > > Could you please review the copyright part to check if it is ok?
> >
> > is there anything special on it? it looks like normal LGPL
> >
> > > I can provide a patch with omitted experimental prefetch code, all the
> > > globals getting 'ff_' prefix, use of 'ff_cropTbl' instead of keeping its
> > > own copy of cropping table
> >
> > patch welcome
> >
> > > (by the way, it would be nice to extend MAX_NEX_CROP to 2048
> > > as idct can produce results in +-2K range when feeded with completely
> > > random data on input).
> >
> > patch welcome
> 
> Patch attached. In order not to break badly on possible MAX_NEG_CROP 
> changes in the future, header file 'dsputil.h' was split into two 
> parts: 'dsputil.h' and 'dsputil_asm.h'. The latter is supposed to contain

just add a commet to the #define saying that the ARM code needs to be updated
if the value is changed

[...]
> Index: libavcodec/armv4l/simple_idct_armv5te.S
> ===================================================================
> --- libavcodec/armv4l/simple_idct_armv5te.S	(revision 10483)
> +++ libavcodec/armv4l/simple_idct_armv5te.S	(working copy)

if the files have nothing in common besides the license headers then please
dont diff them against each other

[...]
> +/*
> + * a local pool with 64-bit constants for 'idct_rows_armv5te' function,
> + * we align it at 16 byte boundary in order to ensure that it does not cross
> + * cache line boundary (occupies only a single cache line)
> + */
> +        .balign 16

ASMALIGN()

> +w2266idct_rows_armv5te:
> +        .long W22
> +        .long W66
> +w1357idct_rows_armv5te:
> +        .long W13
> +        .long W57
> +
> +/*
> + * A rows processing function. Benchmarks on a few video files show that
> + * about 80-90% of uses of this function have all rows empty except for
> + * the row[0].
> + *
> + * On entry:
> + * a1              - row address
> + * lr              - return address
> + *
> + * On exit:
> + * a1              - row address
> + *
> + * Registers usage within this function:
> + *  a1             - row address
> + *  a2             - temporary register
> + *  v5, v6, v7, v8 - row data
> + *  v1, v2, v3, v4 - A0, A1, A2 and A3 variables
> + *  a3, a4         - used for loading constants
> + *  ip             - temporary register
> + *  lr             - temporary register, also holds initial row address value
> + *                   to check end of loop condition
> + */
> +        .balign 32

why 32, didnt you just say cache lines are 16 ?

[...]

> +        smlabb v4, a3, v6, v1          /* v4 = v1 - W2*row[2] */
> +        smlabb v3, a4, v6, v1          /* v3 = v1 - W6*row[2] */
> +        smlatb v2, a4, v6, v1          /* v2 = v1 + W6*row[2] */
> +        smlatb v1, a3, v6, v1          /* v1 = v1 + W2*row[2] */
>  
[---]
> +        smlabb v4, a4, v8, v4          /* v4 -= W6*row[6] */
> +        smlatb v3, a3, v8, v3          /* v3 += W2*row[6] */
> +        smlabb v2, a3, v8, v2          /* v2 -= W2*row[6] */
> +        smlatb v1, a4, v8, v1          /* v1 += W6*row[6] */
> +        ldrd   a3, w1357idct_rows_armv5te /* a3 = W1 | (W3 << 16) */
> +                                       /* a4 = W5 | (W7 << 16) */
>  
[---]
> +        smlatb v4, a2, v7, v4          /* v4 += W4*row[4] */
> +        smlabb v3, a2, v7, v3          /* v3 -= W4*row[4] */
> +        smlabb v2, a2, v7, v2          /* v2 -= W4*row[4] */
> +        smlatb v1, a2, v7, v1          /* v1 += W4*row[4] */
>  

i think this can be implemented in fewer instructions, someting based on:

v2 = v1 - W4*row[4]
v1 = v1 + W4*row[4]

v3 = v2 - W6*row[2]
v4 = v1 - W2*row[2]

v3 += W2*row[6]
v4 -= W6*row[6]

v2 = 2*v2 - v3
v1 = 2*v1 - v4

or you could check that the rows are non zero before multiplying and adding 
them
same applies to the column code

btw do you have a refereence which lists the number of cpu cycles per
instruction?

[...]
> +        smulbt a2, a3, v5              /* b0 = W1*row[1] */
> +        smultt ip, a3, v5              /* tmp = W3*row[1] */
> +        smultt lr, a4, v6              /* -b1 = W7*row[3] */
> +        smlatt a2, a3, v6, a2          /* b0 += W3*row[3] */
> +        smlabt lr, a3, v7, lr          /* -b1 += W1*row[5] */
> +        smlabt a2, a4, v7, a2          /* b0 += W5*row[5] */
> +        smlabt lr, a4, v8, lr          /* -b1 += W5*row[7] */
> +        smlatt a2, a4, v8, a2          /* b0 += W7*row[7] */
> +        sub    lr, ip, lr              /* b1 = -b1 - tmp */
>  
> -        ldr    pc, [sp], #4
> +        /* B0 is now calculated (a2), B1 is now calculated (lr) */
>  
[---]
> +        add    ip, v1, a2              /* ip = (A0 + B0) */
> +        sub    a2, v1, a2              /* a2 = (A0 - B0) */
> +        mov    ip, ip, asr #ROW_SHIFT
> +        mov    a2, a2, asr #ROW_SHIFT
> +        strh   ip, [a1, #0]            /* row[0] = (A0 + B0) >> ROW_SHIFT */
> +        strh   a2, [a1, #14]           /* row[7] = (A0 - B0) >> ROW_SHIFT */
>  
> -        ldr    pc, [sp], #4
> +        ldr    v1, m51                 /* v1 = ((-W5 & 0xFFFF) | ((-W1 & 0xFFFF) << 16)) */
> +
> +        add    ip, v2, lr              /* ip = (A1 + B1) */
> +        sub    a2, v2, lr              /* ip = (A1 - B1) */
> +        mov    ip, ip, asr #ROW_SHIFT
> +        mov    a2, a2, asr #ROW_SHIFT
> +        strh   ip, [a1, #2]            /* row[1] = (A1 + B1) >> ROW_SHIFT */
> +        strh   a2, [a1, #12]           /* row[6] = (A1 - B1) >> ROW_SHIFT */
> +
> +        smulbt a2, a4, v5              /* b2 = W5*row[1] */
> +        smultt v2, a4, v5              /* b3 = W7*row[1] */
> +        smlatt a2, v1, v6, a2          /* b2 -= W1*row[3] */
> +        smlatt v2, a3, v7, v2          /* b3 += W3*row[5] */
> +        smlatt a2, a4, v7, a2          /* b2 += W7*row[5] */
> +        smlatt v2, v1, v8, v2          /* b3 -= W1*row[7] */
> +        smlatt a2, a3, v8, a2          /* b2 += W3*row[7] */
> +        smlabt v2, v1, v6, v2          /* b3 -= W5*row[3] */

somehow i suspect that some speed could be gained by checking  row 3/5/7
for being zero?

[...]
> +/*
> + * Enforce 8 byte stack alignment if it is not provided by ABI. Used at the beginning
> + * of global functions. If stack is not properly aligned, real return address is
> + * pushed to stack (thus fixing stack alignment) and lr register is set to a thunk
> + * function 'unaligned_return_thunk_armv5te' which is responsible for providing
> + * correct return from the function in this case.
> + */
> +        .macro idct_stackalign_armv5te
> +#ifndef DWORD_ALIGNED_STACK
> +        tst    sp, #4
> +        strne  lr, [sp, #-4]!
> +        adrne  lr, unaligned_return_thunk_armv5te
>  #endif
> +        .endm

the compiler has to maintain an properly aligned stack and if needed has
to align it on entry to libavcodec (avcodec_decode_video() and such)
its not acceptable to realign the stack in the inner loop calling the
idct 

[...]
> Index: libavcodec/armv4l/dsputil_arm.c
> ===================================================================
> --- libavcodec/armv4l/dsputil_arm.c	(revision 10483)
> +++ libavcodec/armv4l/dsputil_arm.c	(working copy)
> @@ -29,11 +29,11 @@
>  extern void j_rev_dct_ARM(DCTELEM *data);
>  extern void simple_idct_ARM(DCTELEM *data);
>  
> -extern void simple_idct_armv5te(DCTELEM *data);
> -extern void simple_idct_put_armv5te(uint8_t *dest, int line_size,
> -                                    DCTELEM *data);
> -extern void simple_idct_add_armv5te(uint8_t *dest, int line_size,
> -                                    DCTELEM *data);
> +extern void ff_simple_idct_armv5te(DCTELEM *data);
> +extern void ff_simple_idct_put_armv5te(uint8_t *dest, int line_size,
> +                                       DCTELEM *data);
> +extern void ff_simple_idct_add_armv5te(uint8_t *dest, int line_size,
> +                                       DCTELEM *data);

this renaming belongs to a seperate patch

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

When you are offended at any man's fault, turn to yourself and study your
own failings. Then you will forget your anger. -- Epictetus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070914/f4e88d33/attachment.pgp>