[FFmpeg-devel] [RFC] snow SSE2 optimizations (was: Re: [FFmpeg-cvslog] r10223 - in trunk/libavcodec/i386: dsputil_mmx.c snowdsp_mmx.c)
Tue Aug 28 05:32:04 CEST 2007
On Tue, Aug 28, 2007 at 12:07:02AM +0200, Reimar D?ffinger wrote:
> On Mon, Aug 27, 2007 at 11:34:44PM +0200, Michael Niedermayer wrote:
> > > > also theres some shift by 4 missing here
> > >
> > > I don't think so, there is a "psraw $4, %%xmm0 \n\t"
> > > further down. And I know the code is an unreadable mess. I'll try to
> > > reimplement it somewhen if noone else will do it...
> > the daa after obmc is 16bit unsigned, the data after the IDWT is 13bit
> > signed the white point differs by a factor of 16 a shift by 4 is needed to get
> > them on the same level before adding ...
> Right, right, I just missed a few lines of code while reading the C
> version, thus the confusion.
> Since the diff is unreadable, do you think the following is better than
> the current code (I mean visually, it does decode correctly after all ;-),
> though it is not measurably faster than the mmx code on my PC):
SSE2 is rarely faster than MMX its because most cpus need 2x as long to
execute SSE2 instructions than MMX ...
and yes the code is MUCH more readable than before
> load_block_twolines(2, "%%xmm2", "%%xmm6")
> load_obmc_twolines (8, 16, "%%xmm0", "%%xmm4")
> "pmullw %%xmm0, %%xmm2 \n\t"
> "pmullw %%xmm4, %%xmm6 \n\t"
> "paddusw %%xmm2, %%xmm1 \n\t"
> "paddusw %%xmm6, %%xmm5 \n\t"
paddw will do no usw is needed, though it doesnt hurt, its also the same
speed everywhere IIRC ...
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Those who are too smart to engage in politics are punished by being
governed by those who are dumber. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: Digital signature
More information about the ffmpeg-devel