[FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

Sat Nov 25 10:25:03 EET 2017

On Sat, Nov 18, 2017 at 06:35:48PM +0100, Rafal Dabrowa wrote:
> 
> This is a proposal of performance optimizations for 8-bit
> hevc video decoding on aarch64 platform with neon (simd) extension.
> 
> I'm testing my optimizations on NanoPi M3 device. I'm using
> mainly "Big Buck Bunny" video file in format 1280x720 for testing.
> The video file was pulled from libde265.org page, see
> http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
> The movie duration is 00:10:34.53.
> 
> Overall performance gain is about 2x. Without optimizations the movie
> playback stops in practice after a few seconds. With
> optimizations the file is played smoothly 99% of the time.
> 
> For performance testing the following command was used:
> 
>     time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe - >/dev/null
> 
> The video file was pre-read before test to minimize disk reads during testing.
> Program execution time without optimization was as follows:
> 
> real	11m48.576s
> user	43m8.111s
> sys	0m12.469s
> 
> Execution time with optimizations:
> 
> real	6m17.046s
> user	21m19.792s
> sys	0m14.724s
> 

Can you post the results of checkasm --bench for hevc?

Did you run it to check for any calling convention violation?

> 
> The patch contains optimizations for most heavily used qpel, epel, sao and idct
> functions.  Among the functions provided for optimization there are two
> intensively used, but not optimized in this patch: hevc_v_loop_filter_luma_8
> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
> hence I leaved them without optimizations.
> 

You may want to check x86/hevc_deblock.asm then (no idea if these are
implemented).

[...]
> +function ff_hevc_put_hevc_pel_pixels4_8_neon, export=1
> +    mov     x7, 128
> +1:  ld1     { v0.s }[0], [x1], x2
> +    ushll   v4.8h, v0.8b, 6

> +    st1     { v4.d }[0], [x0], x7

using #128 not possible?

> +    subs    x3, x3, 1
> +    b.ne    1b
> +    ret

here and below: no use of the x6 register?

A few comments on the style:

- please use a consistent spacing (current function mismatches with later
  code), preferably using a relatively large number of spaces as common
  ground (check the other sources)
- we use capitalized size suffixes (B, H, ...); and IIRC the lower case
  form are problematic with some assembler but don't quote me on that.
- we don't use spaces between {}

> +endfunc
> +
> +function ff_hevc_put_hevc_pel_pixels6_8_neon, export=1
> +    mov     x7, 120
> +1:  ld1     { v0.8b }, [x1], x2
> +    ushll   v4.8h, v0.8b, 6

> +    st1     { v4.d }[0], [x0], 8

I think you need to use # as prefix for the immediates

> +    st1     { v4.s }[2], [x0], x7

I assume you can't use #120?

Have you checked if using #128 here and decrementing x0 afterward isn't
faster?

[...]
> +function ff_hevc_put_hevc_pel_bi_pixels32_8_neon, export=1
> +    mov         x10, 128
> +1:  ld1         { v0.16b, v1.16b }, [x2], x3        // src
> +    ushll       v16.8h, v0.8b, 6
> +    ushll2      v17.8h, v0.16b, 6
> +    ushll       v18.8h, v1.8b, 6
> +    ushll2      v19.8h, v1.16b, 6
> +    ld1         { v20.8h, v21.8h, v22.8h, v23.8h }, [x4], x10   // src2
> +    sqadd       v16.8h, v16.8h, v20.8h
> +    sqadd       v17.8h, v17.8h, v21.8h
> +    sqadd       v18.8h, v18.8h, v22.8h
> +    sqadd       v19.8h, v19.8h, v23.8h

> +    sqrshrun    v0.8b,  v16.8h, 7
> +    sqrshrun2   v0.16b, v17.8h, 7
> +    sqrshrun    v1.8b,  v18.8h, 7
> +    sqrshrun2   v1.16b, v19.8h, 7

does pairing helps here?

    sqrshrun    v0.8b,  v16.8h, 7
    sqrshrun    v1.8b,  v18.8h, 7
    sqrshrun2   v0.16b, v17.8h, 7
    sqrshrun2   v1.16b, v19.8h, 7

[...]
> +    sqrshrun    v0.8b,  v16.8h, 7
> +    sqrshrun2   v0.16b, v17.8h, 7
> +    sqrshrun    v1.8b,  v18.8h, 7
> +    sqrshrun2   v1.16b, v19.8h, 7
> +    sqrshrun    v2.8b,  v20.8h, 7
> +    sqrshrun2   v2.16b, v21.8h, 7
> +    sqrshrun    v3.8b,  v22.8h, 7
> +    sqrshrun2   v3.16b, v23.8h, 7

Again, this might be a good candidate for attempting to shuffle the
instructions and see if it helps (there are many other places, I picked
one randomly).

> +.Lepel_filters:

const/endconst + align might be better for all these labels

[...]
> +function ff_hevc_put_hevc_epel_hv12_8_neon, export=1
> +    add         x10, x3, 3
> +    lsl         x10, x10, 7
> +    sub         sp, sp, x10     // tmp_array
> +    stp         x0, x3, [sp, -16]!
> +    stp         x5, x30, [sp, -16]!
> +    add         x0, sp, 32
> +    sub         x1, x1, x2
> +    add         x3, x3, 3
> +    bl          ff_hevc_put_hevc_epel_h12_8_neon
> +    ldp         x5, x30, [sp], 16
> +    ldp         x0, x3, [sp], 16
> +    load_epel_filterh x5, x4
> +    mov         x5, 112
> +    mov         x10, 128
> +    ld1         { v16.8h, v17.8h }, [sp], x10
> +    ld1         { v18.8h, v19.8h }, [sp], x10
> +    ld1         { v20.8h, v21.8h }, [sp], x10
> +1:  ld1         { v22.8h, v23.8h }, [sp], x10
> +    calc_epelh  v4, v16, v18, v20, v22
> +    calc_epelh2 v4, v5, v16, v18, v20, v22
> +    calc_epelh  v5, v17, v19, v21, v23
> +    st1         { v4.8h }, [x0], 16
> +    st1         { v5.4h }, [x0], x5
> +    subs        x3, x3, 1
> +    b.eq        2f
> +

> +    ld1         { v16.8h, v17.8h }, [sp], x10
> +    calc_epelh  v4, v18, v20, v22, v16
> +    calc_epelh2 v4, v5, v18, v20, v22, v16
> +    calc_epelh  v5, v19, v21, v23, v17
> +    st1         { v4.8h }, [x0], 16
> +    st1         { v5.4h }, [x0], x5
> +    subs        x3, x3, 1
> +    b.eq        2f
> +
> +    ld1         { v18.8h, v19.8h }, [sp], x10
> +    calc_epelh  v4, v20, v22, v16, v18
> +    calc_epelh2 v4, v5, v20, v22, v16, v18
> +    calc_epelh  v5, v21, v23, v17, v19
> +    st1         { v4.8h }, [x0], 16
> +    st1         { v5.4h }, [x0], x5
> +    subs        x3, x3, 1
> +    b.eq        2f
> +
> +    ld1         { v20.8h, v21.8h }, [sp], x10
> +    calc_epelh  v4, v22, v16, v18, v20
> +    calc_epelh2 v4, v5, v22, v16, v18, v20
> +    calc_epelh  v5, v23, v17, v19, v21
> +    st1         { v4.8h }, [x0], 16
> +    st1         { v5.4h }, [x0], x5
> +    subs        x3, x3, 1
> +    b.ne        1b

Introducing macros probably makes sense in these functions

[...]
> +8:  b           9f                              // 0
> +    nop
> +    nop
> +    nop
> +    st1         { v29.b }[0], [x7]              // 1
> +    b           9f
> +    nop
> +    nop
> +    st1         { v29.h }[0], [x7]              // 2
> +    b           9f
> +    nop
> +    nop
> +    st1         { v29.h }[0], [x7], 2           // 3
> +    st1         { v29.b }[2], [x7]
> +    b           9f
> +    nop
> +    st1         { v29.s }[0], [x7]              // 4
> +    b           9f
> +    nop
> +    nop
> +    st1         { v29.s }[0], [x7], 4           // 5
> +    st1         { v29.b }[4], [x7]
> +    b           9f
> +    nop
> +    st1         { v29.s }[0], [x7], 4           // 6
> +    st1         { v29.h }[2], [x7]
> +    b           9f
> +    nop
> +    st1         { v29.s }[0], [x7], 4           // 7
> +    st1         { v29.h }[2], [x7], 2
> +    st1         { v29.b }[6], [x7]

What are these nops for? align?

[...]

Anyway, can you split your patch? It's really a lot of code and there is
no way anyone can review it properly quickly.

I also think macros would be welcome in many places to reduce the size of
the patch(es).

Regards,

-- 
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20171125/7afd1465/attachment.sig>