[FFmpeg-devel] libavcodec/exr : add SSE SIMD for reorder_pixels v2 (WIP)

Martin Vignali martin.vignali at gmail.com
Mon Sep 4 00:26:43 EEST 2017


Thanks Ivan for your comments and explanations,

> > [...]
> > +;**********************************************************
> ********************
> > +
> > +%include "libavutil/x86/x86util.asm"
> Still missing explicit x86inc.asm

if i include x86inc instead of x86util, i have linker error (seems that the
prefixe of func become x264, instead of ff)

> > +
> > +    shr                sizeq, 1;       sizeq = half_size
> > +    mov                   r3, sizeq
> > +    shr                   r3, 4;       r3 = half_size/16 -> loop_simd
> count
> > +
> > +loop_simd:
> > +;initial condition loop
> > +    jle      after_loop_simd;          jump to scalar part if loop_simd
> count(r3) is 0
> > +
> > +    movdqa                m0, [srcq];           load first part
> > +    movdqu                m1, [srcq + sizeq];   load second part
> Would you test if moving the movdqu first makes any difference in speed?
> I had a similar case and I think that makes it faster,
> since movdqu has bigger latency.
> Might not matter on newer cpu.
> (If you can't tell the difference, leave it as it is.)

Doesn't notice speed difference.

For the rest of your comments :

You're right, i can remove the scalar part,
the src and dst buffer seems to be padded to 32 in av_fast_padded_malloc
So for the SSE version, can be enough to not overread, overwrite
But need to take care of that, for an avx2 version

I also modify the loop, following your comments.
I offset src, and src2, by half_size, and dst by 2*half_size, so i can
remove some add, sub

and i use half_size * -1, for offset src, src2, and dst

The current asm version is that :  (still WIP, but pass fate test for me)
Need to better check, the max overread, overwrite, for several size value

%include "libavutil/x86/x86util.asm"


; void ff_reorder_pixels(uint8_t *src, uint8_t *dst, int size)

cglobal reorder_pixels, 3,5,3, src, dst, size

    add                     dstq, sizeq;    offset dstq by 2* half_size

    shr                    sizeq, 1;       sizeq = half_size
    mov                       r3, sizeq;   r3 = half_size

    add                     srcq, r3;      offset src by half_size
    mov                       r4, srcq;    r4 is the start of the second
part of the buffer
    add                       r4, r3;      offset r4 by half_size

    neg                       r3;          r3 = half_size * -1 (offset of
dst, src, src2 (r4))

;initial condition loop
    jge      end;

    movdqa                     m0, [srcq+r3];        load first part
    movdqu                     m1, [r4 +r3] ;        load second part

    punpcklbw                  m2,  m0, m1;          interleaved part 1
    movdqa            [dstq+r3*2], m2;               copy to dst array

    punpckhbw                  m0, m1;               interleaved part 2
    movdqa     [dstq+r3*2+mmsize], m0;               copy to dst array

    add                        r3, mmsize
    jmp                 loop_simd


For the perf, the current state is :

Scalar :
3082024 decicycles in reorder_pixels_zip,  130413 runs,    659 skips
bench: utime=115.926s
bench: maxrss=607670272kB

296370 decicycles in reorder_pixels_zip,  130946 runs,    126 skips
bench: utime=101.481s
bench: maxrss=607698944kB

SSE Intrinsics
289448 decicycles in reorder_pixels_zip,  130944 runs,    128 skips
bench: utime=101.417s
bench: maxrss=607694848kB

After taking a look at the asm code generate by clang from intrinsics
version (in O2)

seems like, clang modify the loop_simd part, in order to process twice more
bytes inside the loop
(and it add a condition, to process odd half_size)

I will try to make some test for that, to see if i can have a speed
improvement using the same method



