[FFmpeg-devel] [RFC] An improved implementation of ARMv5TE?IDCT (simple_idct_armv5te.S)
Siarhei Siamashka
siarhei.siamashka
Sat Sep 15 00:06:17 CEST 2007
On 14 September 2007, Michael Niedermayer wrote:
[...]
> > + smlabb v4, a3, v6, v1 /* v4 = v1 - W2*row[2] */
> > + smlabb v3, a4, v6, v1 /* v3 = v1 - W6*row[2] */
> > + smlatb v2, a4, v6, v1 /* v2 = v1 + W6*row[2] */
> > + smlatb v1, a3, v6, v1 /* v1 = v1 + W2*row[2] */
>
> [---]
>
> > + smlabb v4, a4, v8, v4 /* v4 -= W6*row[6] */
> > + smlatb v3, a3, v8, v3 /* v3 += W2*row[6] */
> > + smlabb v2, a3, v8, v2 /* v2 -= W2*row[6] */
> > + smlatb v1, a4, v8, v1 /* v1 += W6*row[6] */
> > + ldrd a3, w1357idct_rows_armv5te /* a3 = W1 | (W3 << 16) */
> > + /* a4 = W5 | (W7 << 16) */
>
> [---]
>
> > + smlatb v4, a2, v7, v4 /* v4 += W4*row[4] */
> > + smlabb v3, a2, v7, v3 /* v3 -= W4*row[4] */
> > + smlabb v2, a2, v7, v2 /* v2 -= W4*row[4] */
> > + smlatb v1, a2, v7, v1 /* v1 += W4*row[4] */
>
> i think this can be implemented in fewer instructions, someting based on:
>
> v2 = v1 - W4*row[4]
> v1 = v1 + W4*row[4]
>
> v3 = v2 - W6*row[2]
> v4 = v1 - W2*row[2]
>
> v3 += W2*row[6]
> v4 -= W6*row[6]
>
> v2 = 2*v2 - v3
> v1 = 2*v1 - v4
Took a close look at it. That really should do the job (each statement mapping
to one instruction), so we can save whole 4 cycles thanks to it. Though I'm
a bit worried about possible overflows because of the *2 multiplication in the
last two statements, so this code would be not completely identical to C
implementation of simple_idct on some extreme cases of input data. Should we
assume some sane restrictions for input data for regression testing?
Anyway, I will try to provide an updated revision of the patch tomorrow with
this optimization included.
More information about the ffmpeg-devel
mailing list