[FFmpeg-devel] [PATCH 1/5] avutil: add pixelutils API

Tue Aug 5 21:13:24 CEST 2014

On Sun, Aug 03, 2014 at 12:36:19AM +0200, Michael Niedermayer wrote:
> On Sat, Aug 02, 2014 at 11:34:07PM +0200, Clément Bœsch wrote:
[...]
> > +#ifdef TEST
> > +#define W1 320
> > +#define H1 240
> > +#define W2 640
> > +#define H2 480
> > +int main(void)
> > +{
> > +    int i, a, ret = 0;
> > +    DECLARE_ALIGNED(32, uint32_t, buf1)[W1*H1];
> > +    DECLARE_ALIGNED(32, uint32_t, buf2)[W2*H2];
> > +    uint32_t state = 0;
> > +
> > +    for (i = 0; i < W1*H1; i++) {
> > +        buf1[i] = state;
> > +        state = state * 1664525 + 1013904223;
> > +    }
> > +
> > +    for (i = 0; i < W2*H2; i++) {
> > +        buf2[i] = state;
> > +        state = state * 1664525 + 1013904223;
> > +    }
> 
> the code should in addition be tested with maximal and minimal
> difference cases
> 

Tests added.

> 
> [...]
> > +;-------------------------------------------------------------------------------
> > +; int ff_pixelutils_sad_[au]_16x16_sse(const uint8_t *src1, ptrdiff_t stride1,
> > +;                                      const uint8_t *src2, ptrdiff_t stride2);
> > +;-------------------------------------------------------------------------------
> > +%macro SAD_XMM_16x16 1
> > +INIT_XMM sse2
> > +cglobal pixelutils_sad_%1_16x16, 4,4,3, src1, stride1, src2, stride2
> > +    pxor        m2, m2
> > +%rep 8
> > +    mov%1       m0, [src2q]
> > +    mov%1       m1, [src2q + stride2q]
> > +    psadbw      m0, [src1q]
> > +    psadbw      m1, [src1q + stride1q]
> > +    paddw       m2, m0
> > +    paddw       m2, m1
> > +    lea         src1q, [src1q + 2*stride1q]
> > +    lea         src2q, [src2q + 2*stride2q]
> > +%endrep
> > +    movhlps     m0, m2
> > +    paddw       m2, m0
> > +    movd        eax, m2
> > +    RET
> > +%endmacro
> 
> there are various improvments possible, though these should be in
> a seperate patch and not in gcc->yasm but
> the pxor can be avoided by lifting the first iteration out and
> using m2 as destination
> 
> it might be faster to use 2 accumulator registers as that way both
> could execute with no dependancies on the other
> 
> as you unroll the loop, addressing can be done with fewer instructions
> 

I left the ASM as is since it was kind of simple and parallel to the API
itself; we can iterate from here with benchmarks

> LGTM otherwise
> 

Patchset applied, thanks

[...]

-- 
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140805/9ed4e216/attachment.asc>