[FFmpeg-devel] [PATCH] lavfi/removegrain: add x86 and x86_64 SSE2 functions

Wed Jul 15 01:14:27 CEST 2015

On 2015-07-14 23:23, James Almer wrote:
> On 14/07/15 3:54 PM, James Darnley wrote:
>> On 2015-07-11 18:34, James Almer wrote:
>>> On 11/07/15 10:40 AM, James Darnley wrote:
>>>> new file mode 100644
>>>> index 0000000..5e1feea
>>>> --- /dev/null
>>>> +++ b/libavfilter/x86/vf_removegrain.asm
>>>> @@ -0,0 +1,1215 @@
>>>> +;*****************************************************************************
>>>> +;* x86-optimized functions for yadif filter
>>>> +;*
>>>> +;* Copyright (C) 2015 James Darnley
>>>> +;*
>>>> +;* This file is part of FFmpeg.
>>>> +;*
>>>> +;* TODO: gpl text goes here.
>>>> +;*****************************************************************************
>>>> +
>>>> +; column: -1  0 +1
>>>> +; row -1: a1 a2 a3
>>>> +; row  0: a4  c a5
>>>> +; row +1: a6 a7 a8
>>>> +
>>>> +%include "libavutil/x86/x86util.asm"
>>>> +
>>>> +SECTION_RODATA
>>>
>>> Declare it SECTION_RODATA 32 now so you don't forget when you add AVX2 versions.
>>
>> Okay, I'm confused by this.  It doesn't align the constants you define
>> on 32 byte boundaries if you happen to define the data incorrectly so...
>>  Why does this matter?  Or what does it do?
>>
> 
> SECTION_RODATA 32 translates into "SECTION .rodata align=32", which according to the 
> documentation makes sure the constants are aligned on a 32-byte boundary.
> I noticed it doesn't do much if the constants are "out of order" if you will, but if
> you're defining them right and you're going to use 32-byte movas with them, then i
> don't see a reason to not declare the section properly.

I will change it.  Perhaps I misunderstood its use.  My impression now
is that it instructs the linker to maintain the alignment of the section
not that the assembler should align the labels in a particular fashion.

If the user (me) defines some data to be 10 bytes long it is the user's
fault that the next constant begins at 0xA.

>>>> +
>>>> +; The loop doesn't need to do all the iterations.  It could stop when the right
>>>> +; pixels are in the right registers.
>>>> +%macro SORT_SQUARE 0
>>>> +    %assign k 7
>>>> +    %rep 7
>>>> +        %assign i 1
>>>> +        %assign j 2
>>>> +        %rep k
>>>> +            SORT_PAIR m %+ i , m %+ j , m9
>>>> +            %assign i i+1
>>>> +            %assign j j+1
>>>> +        %endrep
>>>> +        %assign k k-1
>>>> +    %endrep
>>>> +%endmacro
>>>> +
>>>> +; %1 dest simd register
>>>> +; %2 source (simd register/memory)
>>>> +; %3 temp simd register
>>>> +%macro ABS_DIFF 3
>>>> +    mova %3, %2
>>>> +    psubusb %3, %1
>>>> +    psubusb %1, %2
>>>> +    por %1, %3
>>>> +%endmacro
>>>> +
>>>> +; %1 dest simd register
>>>> +; %2 source (simd register/memory)
>>>> +; %3 temp simd register
>>>> +%macro ABS_DIFF_W 3
>>>> +    mova %3, %2
>>>> +    psubusw %3, %1
>>>> +    psubusw %1, %2
>>>> +    por %1, %3
>>>> +%endmacro
>>>
>>> No way to achieve this using the pabs* ssse3 instructions?
>>
>> Maybe.  I looked them up and see that I would need a subtraction anyway.
>>  I wonder if I could measure a speed change.  I need to check the
>> behaviour of them together with the subtract instructions.
> 
> For that matter, since in many cases you're calling two or more ABS_DIFF_W one after
> the other, you could combine the instructions to minimize dependencies regardless of
> using pabs or psubs + por like you're doing right now.
> It may end up being faster that way.

You're probably right.  I will try them and send a separate patch later.

>>>> +; %1 simd register that hold the mask and will hold the result
>>>> +; %2 simd register that holds the "true" values
>>>> +; %3 location of the "false" values (simd register/memory)
>>>> +%macro BLEND 3 ; mask, true, false
>>>> +    pand %2, %1
>>>> +    pandn %1, %3
>>>> +    por %1, %2
>>>> +%endmacro
>>>
>>> This doesn't follow the sse4 blend instructions' semantic, so it will make adding AVX2
>>> versions harder.
>>>
>>> Try instead
>>>
>>> %macro BLEND 3 ; false, true, mask
>>>     pand      %2, %3
>>>     pandn     %3, %1
>>>     por       %3, %2
>>>     SWAP      %1, %3
>>> %endmacro
>>>
>>> Which will let you use "vpblendvb %1, %1, %2, %3" for the avx2 version.
>>> SSE4's pblendvb is kinda crappy, requiring the mask to be in xmm0, so
>>> adding an SSE4 version may not be worth it.
>>
>> I applied your patches that you sent on IRC (this and the sort macro
>> one).  Do you want me to put your name in the copyright header?
>>
> 
> Not really. small changes like those are IMO not enough to get credit in the header.
> If you want you can add a "With contributions by" line in the commit message.

Okay.

>>>> +av_cold void ff_removegrain_init_x86(RemoveGrainContext *rg)
>>>> +{
>>>> +    int cpu_flags = av_get_cpu_flags();
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < rg->nb_planes; i++) {
>>>> +        switch (rg->mode[i]) {
>>>> +            case 1:
>>>> +                if (EXTERNAL_SSE2(cpu_flags))
>>>
>>> I think it will be much cleaner and simpler if you put the switch inside a single
>>> EXTERNAL_SSE2 branch, rather than checking it for every mode.
>>
>> You're probably right.  I was going to see what it looks like after AVX2
>> and clean it up then but I can squash some changes into this patch if
>> people prefer it.

Done anyway.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: OpenPGP digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20150715/eb4b2196/attachment.sig>