[Ffmpeg-devel] [PATCH] SSE counterpart of ff_imdct_calc_3dn2

Michael Niedermayer michaelni
Sun Aug 20 16:04:27 CEST 2006


Hi

On Sun, Aug 20, 2006 at 06:15:06PM +0800, Zuxy Meng wrote:
> Hi,
> 
> The patch is simply a re-write of Loren's recent work. fft-test shows
> a speed-up around 18%~20% in my Pentium M 2G, not very exciting but
> faster indeed. Please kindly take a review.

no objections to the patch but see comments below


[...]

> +void ff_imdct_calc_sse(MDCTContext *s, FFTSample *output,
> +                       const FFTSample *input, FFTSample *tmp)
> +{
> +    long k, n8, n4, n2, n;
> +    const uint16_t *revtab = s->fft.revtab;
> +    const FFTSample *tcos = s->tcos;
> +    const FFTSample *tsin = s->tsin;
> +    const FFTSample *in1, *in2;
> +    FFTComplex *z = (FFTComplex *)tmp;
> +
> +    n = 1 << s->nbits;
> +    n2 = n >> 1;
> +    n4 = n >> 2;
> +    n8 = n >> 3;
> +
> +    asm volatile ("movaps %0, %%xmm7\n\t"::"m"(*p1m1p1m1));
> +    
> +    /* pre rotation */
> +    in1 = input;
> +    in2 = input + n2 - 4;
> +    
> +    /* Complex multiplication 
> +       Two complex products per iteration, we could have 4 with 8 xmm
> +       registers, 8 with 16 xmm registers.
> +       Maybe we should unroll more.
> +    */
> +    for (k = 0; k < n4; k += 2) {
> +        asm volatile (
> +            "movaps          %0, %%xmm0 \n\t"   // xmm0 = r0 X  r1 X : in2
> +            "movaps          %1, %%xmm3 \n\t"   // xmm3 = X  i1 X  i0: in1
> +            "movlps          %2, %%xmm1 \n\t"   // xmm1 = X  X  R1 R0: tcos
> +            "movlps          %3, %%xmm2 \n\t"   // xmm2 = X  X  I1 I0: tsin
> +            "shufps $95, %%xmm0, %%xmm0 \n\t"   // xmm0 = r1 r1 r0 r0
> +            "shufps $160,%%xmm3, %%xmm3 \n\t"   // xmm3 = i1 i1 i0 i0

> +            "unpcklps    %%xmm2, %%xmm1 \n\t"   // xmm1 = I1 R1 I0 R0

the above and one memory read can be avoided by changing the tsin/tcos tables
that would also reduce the number of pointers and maybe avoid the register
shortage gcc ends up with below


> +            "movaps      %%xmm1, %%xmm2 \n\t"   // xmm2 = I1 R1 I0 R0
> +            "xorps       %%xmm7, %%xmm2 \n\t"   // xmm2 = -I1 R1 -I0 R0
> +            "mulps       %%xmm1, %%xmm0 \n\t"   // xmm0 = rI rR rI rR
> +            "shufps $177,%%xmm2, %%xmm2 \n\t"   // xmm2 = R1 -I1 R0 -I0
> +            "mulps       %%xmm2, %%xmm3 \n\t"   // xmm3 = Ri -Ii Ri -Ii
> +            "addps       %%xmm3, %%xmm0 \n\t"   // xmm0 = result
> +            ::"m"(in2[-2*k]), "m"(in1[2*k]),
> +              "m"(tcos[k]), "m"(tsin[k])
> +        );
> +        /* Should be in the same block, hack for gcc2.95 & gcc3 */
> +        asm (
> +            "movlps      %%xmm0, %0     \n\t"
> +            "movhps      %%xmm0, %1     \n\t"
> +            :"=m"(z[revtab[k]]), "=m"(z[revtab[k + 1]])
> +        );
> +    }

what about writing the whole loop in asm? i bet you can do better then gcc :)


[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

In the past you could go to a library and read, borrow or copy any book
Today you'd get arrested for mere telling someone where the library is




More information about the ffmpeg-devel mailing list