[Ffmpeg-devel] [patch] Snow - add_yblock x86 asm

Sun Apr 10 06:58:46 CEST 2005

On Sat, 2005-04-09 at 16:40 +0200, Michael Niedermayer wrote: 
> Hi
> 
> On Saturday 09 April 2005 16:05, Yartrebo wrote:
> >
> > okay, if you like it that way. Could you tell me how to do the run-time
> > code selection and how to place it in dsputil?
> 
> just look at libavcodec/{,i386/}dsputil*.{c,h}, its very simple
> theres a DSPContext with function pointers, which get set to the specific 
> implementation like plain c or mmx
> the actual implementations should also be in their own files as dsputil*.c is 
> already quite big (snow_dsp.c and snow_dsp_mmx.c might be possible names)
> 
I would really like to not put it in dsputil, and here are my reasons:
  - They are snow specific and would require a large and awkward
    interface with zero chance of reuse.
  - Performance is substantially compromised. On my slow P4, the 8x8 SSE2
    function takes nary 420 clocks to finish. A large indirect function call
     will do a serious number on the execution time.

  - I have them cleanly tucked away. In all, doing
    it my way adds 7 lines to snow.c (versus 4 or so for dsputil).
  - Run-time checking and compile checking are both supported.
  - This implementation is much more efficient. Overhead is a handful of
    clocks instead of possibly over 100 for a large (9 or so parameters)
    indirect function call. My implementation is 100% pure preprocessor code
    if you ignore comments (7 preprocessor lines in snow.c, about 640 in
    snow_mmx_sse2.h, and about a dozen in mmx.h). The code is directly inserted
    into add_yblock_buffered() and results in a much smaller executable size AND
    a faster executable.
  - It's clean. dsputil really doesn't need more garbage.

On a similar note, I propose keeping the name snow_mmx_sse2.h for the
implementation, as I don't want it to be confused with dsputil.
> 
> >
> >
> > LOG2_OBMC_MAX is kept to 6 for the C because you wanted the minimum bits
> > possible for the general implementation.
> 
> i just wanted to keep the constants fit into 6bits not necessary keep the 
> actual table unchanged, i have no problem with multiplying them with 4 or 
> 1024 or whatever as long as it doesnt change the bitstream / we can reverse 
> it if its slower on some cpu, without breaking compatibility
> 
fixed. constants are now all multiplied by four, and MAX_OBMC_LOG is now 8.
> 
> [...]
> > > > +#define add_yblock_bw_8_obmc_16_mmx \
> > > > +        pxor_r2r(mm7, mm7);\
> > > > +        pcmpeqd_r2r(mm4, mm4);\
> > > > +        pslld_i2r(31, mm4);\
> > >
> > > please dont mix mmx.h with asm() style
> > >
> > > > +             "movq (%%esi), %%xmm0;       \n\t"\
> > > > +             "movl 12(%%eax), %%edx;      \n\t"\
> > >
> > > i dont think this will work on x86-64
> >
> > Unfortunately, none of my pointer code will work on x86-64. To be
> > honest, I thought that mplayer assumed that int == * == int32_t == 4 in
> > a lot of places and that mplayer would be forever 32 bit.
> >
> > Would you have any suggestions as to how to make code run in both 32 bit
> > and 64 bit modes without much or any loss of performance?
> 
> please see the x86-64 patch, which should explain the problem better
> http://www1.mplayerhq.hu/cgi-bin/cvsweb.cgi/ffmpeg/libavcodec/i386/dsputil_mmx.c.diff?r1=1.87&r2=1.88&cvsroot=FFMpeg
> 
fixed, hopefully. I need someone with an AMD-64 machine to test it.

Also, I've taken the liberty to improve my slice_buffer by using a single
malloc. I didn't realize that it would have a measurable impact.

I've also filled out the set of #define REG_x [e,r]ax in mmx.h because
my code needs them for AMD-64 interoperability.

Sincerely,
Robert Edele

-------------- next part --------------
A non-text attachment was scrubbed...
Name: snow_add_yblock_mmx_sse2.patch
Type: text/x-patch
Size: 39632 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20050410/d46754a9/attachment.bin>