[FFmpeg-devel] [RFC/PATCH] More flexible variafloat_to_int16 , WMA optimization, Vorbis

Michael Niedermayer michaelni
Wed Jul 16 12:40:28 CEST 2008


On Tue, Jul 15, 2008 at 03:01:12PM -0600, Loren Merritt wrote:
> On Tue, 15 Jul 2008, Michael Niedermayer wrote:
> > On Tue, Jul 15, 2008 at 08:58:23AM -0600, Loren Merritt wrote:
> >> On Tue, 15 Jul 2008, Michael Niedermayer wrote:
> > [...]
> >>> It also might be worth to look at mplayer/liba52/resample_mmx.c, maybe
> >>> some
> >>> of that code could be reused. Especially as we do not have a MMX
> >>> float_to_int16, besides the trick used could be tried with SSE2.
> >>
> >> I'm not very interested in optimizing for pentium2 / k6-1. I'm not sure I
> >> could, anyway; that's so far removed from anything I can benchmark on.
> >
> > Well, maybe you are interrested an a Merom-2M
> > Your SSE2                           : 16009
> > My ancient MMX trick ported to SSE2 : 14764
> 
> Don't forget to include the cost of add_bias, since you're returning to 
> [384.0,386.0] scale.
> 
> Merom-2M (T5470), 1024 samples, 2 channels
> svn sse2 : 14751
> your sse2: 13630 + bias during windowing or something
> below    : 17237
> 
> @@ -2223,9 +2225,15 @@
> )
> 
> FLOAT_TO_INT16_INTERLEAVE(sse2,
> +    "movdqa ff_pd_0x43c08000, %%xmm7 \n"
> +    "movdqa ff_ps_385, %%xmm6   \n"
>       "1:                         \n"
> -    "cvtps2dq  (%2,%0), %%xmm0  \n"
> -    "cvtps2dq  (%3,%0), %%xmm1  \n"
> +    "movdqa    (%2,%0), %%xmm0  \n"
> +    "movdqa    (%3,%0), %%xmm1  \n"
> +    "addps      %%xmm6, %%xmm0  \n"
> +    "addps      %%xmm6, %%xmm1  \n"
> +    "psubd      %%xmm7, %%xmm0  \n"
> +    "psubd      %%xmm7, %%xmm1  \n"
>       "packssdw   %%xmm1, %%xmm0  \n"
>       "movhlps    %%xmm0, %%xmm1  \n"
>       "punpcklwd  %%xmm1, %%xmm0  \n"

mixing float & integer has some extra overhead IIRC.
Another comprission that places the float in its own loop, i assume
here such loop exists either way

@@ -2222,8 +2224,22 @@
     "emms                       \n"
 )
 
+DECLARE_ALIGNED_16(const xmm_t, ff_pd_0x43c08000  ) = {0x43c0800043c08000ULL, 0x43c0800043c08000ULL};
+DECLARE_ALIGNED_16(const float, ff_ps_385[4]) = { 385, 385, 385, 385 };
+
 FLOAT_TO_INT16_INTERLEAVE(sse2,
+    "movdqa ff_pd_0x43c08000, %%xmm7 \n"
+    "movdqa ff_ps_385, %%xmm6   \n"
+    "push %0                    \n"
     "1:                         \n"
+    "movaps  (%2,%0), %%xmm0  \n"
+    "movaps  (%3,%0), %%xmm1  \n"
+    "movaps  %%xmm0, (%2,%0)  \n"
+    "movaps  %%xmm1, (%3,%0)  \n"
+    "add $16, %0                \n"
+    "js 1b                      \n"
+    "pop %0                     \n"
+    "1:                         \n"
     "cvtps2dq  (%2,%0), %%xmm0  \n"
     "cvtps2dq  (%3,%0), %%xmm1  \n"
     "packssdw   %%xmm1, %%xmm0  \n"

VS.

@@ -2222,10 +2224,28 @@
     "emms                       \n"
 )
 
+DECLARE_ALIGNED_16(const xmm_t, ff_pd_0x43c08000  ) = {0x43c0800043c08000ULL, 0x43c0800043c08000ULL};
+DECLARE_ALIGNED_16(const float, ff_ps_385[4]) = { 385, 385, 385, 385 };
+
 FLOAT_TO_INT16_INTERLEAVE(sse2,
+    "movdqa ff_pd_0x43c08000, %%xmm7 \n"
+    "movdqa ff_ps_385, %%xmm6   \n"
+    "push %0                    \n"
     "1:                         \n"
-    "cvtps2dq  (%2,%0), %%xmm0  \n"
-    "cvtps2dq  (%3,%0), %%xmm1  \n"
+    "movaps  (%2,%0), %%xmm0  \n"
+    "movaps  (%3,%0), %%xmm1  \n"
+    "addps      %%xmm6, %%xmm0  \n"
+    "addps      %%xmm6, %%xmm1  \n"
+    "movaps  %%xmm0, (%2,%0)  \n"
+    "movaps  %%xmm1, (%3,%0)  \n"
+    "add $16, %0                \n"
+    "js 1b                      \n"
+    "pop %0                     \n"
+    "1:                         \n"
+    "movdqa  (%2,%0), %%xmm0  \n"
+    "movdqa  (%3,%0), %%xmm1  \n"
+    "psubd %%xmm7, %%xmm0\n"
+    "psubd %%xmm7, %%xmm1\n"
     "packssdw   %%xmm1, %%xmm0  \n"
     "movhlps    %%xmm0, %%xmm1  \n"
     "punpcklwd  %%xmm1, %%xmm0  \n"

addps-psubd 23607
cvtps2dq    24861

I do not know why exactly mine is still faster, its possible the difference
is partly due to the simplifications of the comparission but it is still
faster even with the addps for me when addps is in a seperate loop.

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I count him braver who overcomes his desires than him who conquers his
enemies for the hardest victory is over self. -- Aristotle
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080716/ee216968/attachment.pgp>



More information about the ffmpeg-devel mailing list