[FFmpeg-devel] swscale/rgb2rgb : add X86_64 SIMD (SSSE3 and AVX2) for shuffly_bytes func

James Almer jamrial at gmail.com
Sun Mar 18 18:27:21 EET 2018


On 3/18/2018 1:23 PM, Martin Vignali wrote:
> 2018-03-18 16:49 GMT+01:00 James Almer <jamrial at gmail.com>:
> 
>> On 3/18/2018 12:08 PM, Martin Vignali wrote:
>>> 2018-03-03 18:20 GMT+01:00 Martin Vignali <martin.vignali at gmail.com>:
>>>
>>>> Hello,
>>>>
>>>> Patch in attach add SIMD for the 5 shuffle_bytes func for rgb2rgb
>>>> The new SIMD are write using external ASM.
>>>>
>>>> Also add checkasm test for theses func
>>>> Restricted to x86_64, because the scalar part doesn't compile on x86_32
>>>>
>>>> I consider for the scalar part that the src_size value is a multiple of
>> 4
>>>> (because the shuffle is for 4 bytes)
>>>>
>>>> Pass fate test on X86_64 and X86_32 (os 10.12)
>>>>
>>>>
>>>>
>>>>
>>>> New patchs in attach :
>>> - Now compile on x86_32 and x86_64
>>> - Add cosmetic patch to put all shuffle_bytes declaration in the same
>> place
>>>
>>> Tested on X86_64 and X86_32 (os 10.12)
>>>
>>> Checkasm result :  ./tests/checkasm/checkasm --test=sw_rgb --bench
>>>
>>> checkasm: using random seed 292997963
>>> MMX:
>>>  - sw_rgb.shuffle_bytes_2103 [OK]
>>> MMXEXT:
>>>  - sw_rgb.shuffle_bytes_2103 [OK]
>>> SSSE3:
>>>  - sw_rgb.shuffle_bytes_2103 [OK]
>>>  - sw_rgb.shuffle_bytes_0321 [OK]
>>>  - sw_rgb.shuffle_bytes_1230 [OK]
>>>  - sw_rgb.shuffle_bytes_3012 [OK]
>>>  - sw_rgb.shuffle_bytes_3210 [OK]
>>> AVX2:
>>>  - sw_rgb.shuffle_bytes_2103 [OK]
>>>  - sw_rgb.shuffle_bytes_0321 [OK]
>>>  - sw_rgb.shuffle_bytes_1230 [OK]
>>>  - sw_rgb.shuffle_bytes_3012 [OK]
>>>  - sw_rgb.shuffle_bytes_3210 [OK]
>>> checkasm: all 12 tests passed
>>> shuffle_bytes_0321_c: 51.4
>>> shuffle_bytes_0321_ssse3: 18.7
>>> shuffle_bytes_0321_avx2: 12.7
>>> shuffle_bytes_1230_c: 126.9
>>> shuffle_bytes_1230_ssse3: 16.7
>>> shuffle_bytes_1230_avx2: 12.9
>>> shuffle_bytes_2103_c: 52.4
>>> shuffle_bytes_2103_mmx: 76.7
>>> shuffle_bytes_2103_mmxext: 197.2
>>> shuffle_bytes_2103_ssse3: 17.4
>>> shuffle_bytes_2103_avx2: 12.4
>>> shuffle_bytes_3012_c: 127.4
>>> shuffle_bytes_3012_ssse3: 14.7
>>> shuffle_bytes_3012_avx2: 12.4
>>> shuffle_bytes_3210_c: 127.4
>>> shuffle_bytes_3210_ssse3: 18.2
>>> shuffle_bytes_3210_avx2: 12.9
>>
>> These AVX2 numbers are not worth it. Some CPU archs throttle down the
>> frequency when using ymm instructions, so unless the function is
>> considerably faster than the SSE* version then it's usually not worth
>> adding.
>>
>>
> I run the test again with a bigger width (512 instead of 128)
> This is my result :
> shuffle_bytes_0321_c: 128.6
> shuffle_bytes_0321_ssse3: 41.6
> shuffle_bytes_0321_avx2: 23.4
> shuffle_bytes_1230_c: 626.4
> shuffle_bytes_1230_ssse3: 41.6
> shuffle_bytes_1230_avx2: 23.9
> shuffle_bytes_2103_c: 128.4
> shuffle_bytes_2103_mmx: 307.1
> shuffle_bytes_2103_mmxext: 224.6
> shuffle_bytes_2103_ssse3: 72.9
> shuffle_bytes_2103_avx2: 32.9
> shuffle_bytes_3012_c: 620.9
> shuffle_bytes_3012_ssse3: 40.6
> shuffle_bytes_3012_avx2: 36.1
> shuffle_bytes_3210_c: 602.6
> shuffle_bytes_3210_ssse3: 75.4
> shuffle_bytes_3210_avx2: 33.6
> 
> 
> So except for the 3012 version (don't know why), we are around x2 in AVX2.
> Do you still think, it's need to remove AVX2 version ?
> 
> 
> Martin

No, those look good now.


More information about the ffmpeg-devel mailing list