[FFmpeg-devel] libavcodec/exr : add SSE2 simd for reorder pixels (WIP)
Ivan Kalvachev
ikalvachev at gmail.com
Sun Aug 6 02:23:02 EEST 2017
On 8/5/17, Martin Vignali <martin.vignali at gmail.com> wrote:
> Hello,
>
> Based on pull request in the official openexr library :
> https://github.com/openexr/openexr/pull/229/commits/4198128397c033d4f69e5cc0833195da500c31cf
>
> i try to port the reorder part (used in zip and rle decompression), to asm
> Tested on OS X
> pass fate test for me
>
> I test with this command :
> ./ffmpeg -i ./fate-suite/exr/rgba_slice_zip16.exr -f null -;./ffmpeg -i
> ./fate-suite/exr/rgba_slice_zip16.exr -f null - -cpuflags 0
>
> the result of the timer are :
> scalar : 960355 decicycles in reorder_pixels_zip, 64 runs, 0
> skips
> sse : 84763 decicycles in reorder_pixels_zip, 64 runs, 0
> skips
>
> Comments Welcome
>
> Martin
> Jokyo Images
64 runs seems way too few runs for reliable result.
Please try to run ffmpeg encoding for about a minute.
About the patch:
> +%include "libavutil/x86/x86util.asm"
> +
> +SECTION .text
Please include "x86inc.asm" explicitly.
> +INIT_XMM sse2
> +cglobal reorder_pixels, 3,5,3, src, dst, size
> +
> +shr r2, 1;half_size
It's better to use the labels you define in the cglobal,
If I count correctly, "r2" would be "sizeq".
("srcq" and "dstq" would be the correct labels for the pointers).
> +;calc loop count for simd reorder
> +mov r3, r2;
> +shr r3, 4;calc loop count simd
Align the instruction at 4 spaces.
Align the first operands so that the ending commas are vertically aligned.
For the follow up operands, just add one space after the comma.
Put some spaces before the comment, to separate it from the operands.
BTW, this is not C, you don't need to end every line with ";"
e.g.
> + mov r3, r2
> + shr r3, 4 ;calc loop count simd
> +;jump to afterloop2 if loop count simd is 0
> +cmp r3, 0;
> +jle afterloop2;
If you only check for size==0, then
the "shr" has already set the correct Z flag.
However, if r3/size==1, you'd still read
16 bytes at once in the first loop.
Aka, overread.
> +;simd loop
> +loop1:
> +movdqa xmm0, [r0];load first 16 bytes
Use "m0" instead.
> +lea r4, [r0 + r2];
> +
> +movdqu xmm1, [r4]; unaligned load
r4 doesn't seem to be used for anything else.
Just move the address calculation directly into
the "movdqu", it can take it.
> +movdqa xmm2, xmm0;copy xmm0
> +
> +punpcklbw xmm2, xmm1;
> +movdqa [r1], xmm2;
> +add r1, 16;
use "mmsize" instead of 16.
> +movdqa xmm2, xmm0;
> +punpckhbw xmm2, xmm1;
> +movdqa [r1], xmm2;
> +add r1, 16;
You can actually avoid one of the "add"
if you use [r1+mmsize] and add 32 at the second
"add" (or 2*mmsize).
> +dec r3;
> +add r0, 16;
> +; test repeat
> +cmp r3, 0;
> +jge loop1;
First, "dec" instructions are avoided,
because they do a partial update
on the flag register and
this causes interdependence.
Second, you can use the flags from
the "sub" to check if it has reached
zero or has gone negative.
> +afterloop2:
> +;scalar part
> +
> +mov r3, r2;
> +and r3, 15 ;half_size % 16
> +lea r4, [r0 + r2];
> +
> +;initial condition loop
> +cmp r3, 0;
> +jle end;
> +
> +loop2:
> +mov r1, r0;
> +inc r1;
> +mov r1, r4;
> +inc r1;
> +inc r0;
> +inc r4;
> +dec r3;
> +; test repeat
> +cmp r3, 0;
> +jg loop2;
O_o
This loop does not read or write to memory.
You can replace the second "cmp" in this
loop with unconditional jump to the
"initial condition loop" compare.
(aka move loop2: two instruction above).
This is how "for(;;)" is usually implemented in C compilers.
Be sure to test your function with different sizes,
to salt your output buffers and check for underwrites, overwrites
etc...
Best Regards.
More information about the ffmpeg-devel
mailing list