[FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()

chen chenm003 at 163.com
Wed Dec 4 13:41:23 EET 2019



At 2019-12-04 16:51:52, "Paul B Mahol" <onemda at gmail.com> wrote:
>On 12/4/19, Song, Ruiling <ruiling.song at intel.com> wrote:
>>> -----Original Message-----
>>> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
>>> chen

>>> >> At 2019-12-03 15:52:07, xujunzz at sjtu.edu.cn wrote:
>>> >> >From: Xu Jun <xujunzz at sjtu.edu.cn>
>>> >[...]
>>> >> >+
>>> >> >+        cvtdq2ps m4, m4
>>> >> >+        mulps m4, m0     ; sum *= rdiv
>>> >> >+        addps m4, m1     ; sum += bias
>>> >>
>>> >> >+        addps m4, m5     ; sum += 0.5
>>> >> I don't know how about precision mismatch if we pre-compute (bias+0.5)
>>>
>>> >I think it is hard to prove it is safe to do pre-compute.
>>> Agree, I also worried precision issue since float operator is execute
>>> order
>>> dependent.
>>> How about ROUNDPS?

>> Seems no exactly match.
Funny, I guess it is other issue, such as mistake on instruction's imm field.


>>> >> >+        cvttps2dq m4, m4
>>> >> >+        packssdw m4, m4
>>> >> >+        packuswb m4, m4
>>> >> >+        movss [dstq + dst_offq], m4
>>> >> >+        add c_offq, mmsize/4
>>> >> >+        add dst_offq, mmsize/4
>>> >> >+
>>> >> >+        add off16q, mmsize/4
>>> >> >+        cmp off16q, widthq
>>> >> >+        jl .loop16
>>> >> >+
>>> >> >+    add widthq, rq
>>> >> >+    cmp off16q, widthq
>>> >> >+    jge .paraend
>>> >> >+
>>> >>
>>> >> >+    .loopr:
>>> >> no idea about this loop, if we can read beyond, we can reuse above
>>> >> SIMD
>>> >> code
>>> >Reuse above SIMD code may write to the memory that does not belong to
>>> this slice-thread.
>>>
>>> >IMO, the code to handle remainder columns is still necessary.
>>>
>>>
>>> Depends on algorithm & size,
>>> For example width=23
>>> Process #0 [0:15]
>>> Process #1 [7:22]
>>> Both of them is multiple of 16
>> Sounds interesting. But FFmpeg does not do like this now.
>> One question is will this get a penalty for writing to same address of
>> memory (both are writing to 7-15) from different threads?
>
>Yes, and even bad results may happen.

>
This is my problem, I don't speak clean, the "Process #x" is one step of loops,
I guess the function must be atomic, we can place any threading that work on same address area.



More information about the ffmpeg-devel mailing list