[FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon
atomnuker at gmail.com
Sat Nov 18 20:31:00 EET 2017
> On 18 November 2017 at 17:35, Rafal Dabrowa <fatwildcat at gmail.com> wrote:
> This is a proposal of performance optimizations for 8-bit
> hevc video decoding on aarch64 platform with neon (simd) extension.
> I'm testing my optimizations on NanoPi M3 device. I'm using
> mainly "Big Buck Bunny" video file in format 1280x720 for testing.
> The video file was pulled from libde265.org page, see
> The movie duration is 00:10:34.53.
> Overall performance gain is about 2x. Without optimizations the movie
> playback stops in practice after a few seconds. With
> optimizations the file is played smoothly 99% of the time.
> For performance testing the following command was used:
> time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe
> - >/dev/null
> The video file was pre-read before test to minimize disk reads during
> Program execution time without optimization was as follows:
> real 11m48.576s
> user 43m8.111s
> sys 0m12.469s
> Execution time with optimizations:
> real 6m17.046s
> user 21m19.792s
> sys 0m14.724s
> The patch contains optimizations for most heavily used qpel, epel, sao and
> functions. Among the functions provided for optimization there are two
> intensively used, but not optimized in this patch:
> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
> hence I leaved them without optimizations.
> Signed-off-by: Rafal Dabrowa <fatwildcat at gmail.com>
> libavcodec/aarch64/Makefile | 5 +
> libavcodec/aarch64/hevcdsp_epel_8.S | 3949 ++++++++++++++++++++
> libavcodec/aarch64/hevcdsp_idct_8.S | 1980 ++++++++++
> libavcodec/aarch64/hevcdsp_init_aarch64.c | 170 +
> libavcodec/aarch64/hevcdsp_qpel_8.S | 5666
> libavcodec/aarch64/hevcdsp_sao_8.S | 166 +
> libavcodec/hevcdsp.c | 2 +
> libavcodec/hevcdsp.h | 1 +
> 8 files changed, 11939 insertions(+)
> create mode 100644 libavcodec/aarch64/hevcdsp_epel_8.S
> create mode 100644 libavcodec/aarch64/hevcdsp_idct_8.S
> create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
> create mode 100644 libavcodec/aarch64/hevcdsp_qpel_8.S
> create mode 100644 libavcodec/aarch64/hevcdsp_sao_8.S
The way we test SIMD is to put START_TIMER("function_name"); and
STOP_TIMER; (they're located in libavutil/timer.h) around where the
function gets called in the C code, then we do a run with the C code (no
SIMD) and a separate run with whatever SIMD optimizations we're
implementing. We take the last printed value of both runs and that's what's
used to measure speedup.
I don't think there's a need to split the patch into multiple patches for
each idividual version though yet, that's usually only done if some
function's C implementation is faster than the SIMD code.
More information about the ffmpeg-devel