[FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

Sat Nov 18 20:31:00 EET 2017

>
>
>
> On 18 November 2017 at 17:35, Rafal Dabrowa <fatwildcat at gmail.com> wrote:
>
> This is a proposal of performance optimizations for 8-bit
> hevc video decoding on aarch64 platform with neon (simd) extension.
>
> I'm testing my optimizations on NanoPi M3 device. I'm using
> mainly "Big Buck Bunny" video file in format 1280x720 for testing.
> The video file was pulled from libde265.org page, see
> http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
> The movie duration is 00:10:34.53.
>
> Overall performance gain is about 2x. Without optimizations the movie
> playback stops in practice after a few seconds. With
> optimizations the file is played smoothly 99% of the time.
>
> For performance testing the following command was used:
>
>     time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe
> - >/dev/null
>
> The video file was pre-read before test to minimize disk reads during
> testing.
> Program execution time without optimization was as follows:
>
> real    11m48.576s
> user    43m8.111s
> sys     0m12.469s
>
> Execution time with optimizations:
>
> real    6m17.046s
> user    21m19.792s
> sys     0m14.724s
>
>
> The patch contains optimizations for most heavily used qpel, epel, sao and
> idct
> functions.  Among the functions provided for optimization there are two
> intensively used, but not optimized in this patch:
> hevc_v_loop_filter_luma_8
> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
> hence I leaved them without optimizations.
>
>
>
> Signed-off-by: Rafal Dabrowa <fatwildcat at gmail.com>
> ---
>  libavcodec/aarch64/Makefile               |    5 +
>  libavcodec/aarch64/hevcdsp_epel_8.S       | 3949 ++++++++++++++++++++
>  libavcodec/aarch64/hevcdsp_idct_8.S       | 1980 ++++++++++
>  libavcodec/aarch64/hevcdsp_init_aarch64.c |  170 +
>  libavcodec/aarch64/hevcdsp_qpel_8.S       | 5666
> +++++++++++++++++++++++++++++
>  libavcodec/aarch64/hevcdsp_sao_8.S        |  166 +
>  libavcodec/hevcdsp.c                      |    2 +
>  libavcodec/hevcdsp.h                      |    1 +
>  8 files changed, 11939 insertions(+)
>  create mode 100644 libavcodec/aarch64/hevcdsp_epel_8.S
>  create mode 100644 libavcodec/aarch64/hevcdsp_idct_8.S
>  create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
>  create mode 100644 libavcodec/aarch64/hevcdsp_qpel_8.S
>  create mode 100644 libavcodec/aarch64/hevcdsp_sao_8.S

Very nice.
The way we test SIMD is to put START_TIMER("function_name"); and
STOP_TIMER; (they're located in libavutil/timer.h) around where the
function gets called in the C code, then we do a run with the C code (no
SIMD) and a separate run with whatever SIMD optimizations we're
implementing. We take the last printed value of both runs and that's what's
used to measure speedup.

I don't think there's a need to split the patch into multiple patches for
each idividual version though yet, that's usually only done if some
function's C implementation is faster than the SIMD code.