[FFmpeg-devel] [PATCH] Optimized unscaled yuvp9/yuvp10 -> yuvp16 conversion.

Sun Aug 12 14:47:09 CEST 2012

On Sat, Aug 11, 2012 at 04:52:19PM +0200, Michael Niedermayer wrote:
> On Sat, Aug 11, 2012 at 02:18:36PM +0200, Reimar Döffinger wrote:
> > About 30% faster on 32 bit Atom, 120% faster on 64 bit Phenom2.
> > This is interesting because supporting P16 is easier in e.g.
> > OpenGL (can misuse support for any 2-component 8 bit format),
> > whereas supporting p9/p10 without conversion needs a texture
> > format with at least 14 bits actual precision.
> > 
> > Signed-off-by: Reimar Döffinger <Reimar.Doeffinger at gmx.de>
> > ---
> >  libswscale/swscale_unscaled.c |   26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c
> > index c391a07..6618966 100644
> > --- a/libswscale/swscale_unscaled.c
> > +++ b/libswscale/swscale_unscaled.c
> > @@ -830,7 +830,33 @@ static int planarCopyWrapper(SwsContext *c, const uint8_t *src[],
> >                          srcPtr  += srcStride[plane];
> >                      }
> >                  } else if (src_depth <= dst_depth) {
> > +                    int orig_length = length;
> >                      for (i = 0; i < height; i++) {
> > +                        if(isBE(c->srcFormat) == HAVE_BIGENDIAN &&
> > +                           isBE(c->dstFormat) == HAVE_BIGENDIAN) {
> > +                             unsigned shift = dst_depth - src_depth;
> > +                             length = orig_length;
> > +#if HAVE_FAST_64BIT
> > +#define FAST_COPY_UP(shift) \
> > +    for (j = 0; j < length - 3; j += 4) { \
> > +        uint64_t v = AV_RN64A(srcPtr2 + j); \
> > +        AV_WN64A(dstPtr2 + j, v << shift); \
> > +    } \
> > +    length &= 3;
> > +#else
> > +#define FAST_COPY_UP(shift) \
> > +    for (j = 0; j < length - 1; j += 2) { \
> > +        uint32_t v = AV_RN32A(srcPtr2 + j); \
> > +        AV_WN32A(dstPtr2 + j, v << shift); \
> > +    } \
> > +    length &= 1;
> > +#endif
> 
> these look wrong for the shiftonly==0 case

Ops, sorry, I went back and forth a few time how to handle that case
and at some point the condition was lost.
The code is not meant to handle shiftonly==0 because
a) The case I was looking at (MPlayer) never uses it
b) It needs an extra "and" compared to the non-SIMDified version,
which means for 32 bit it tends to not be relevantly faster, at
least for some compiler/compiler options variations (for example
when compiling with 4.6 for Atom the loop won't be unrolled, so
lots of loop overhead, whereas when compiling for k8 it will be
unrolled and prefetch added...).