[FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

Sat Nov 18 20:41:13 EET 2017

On 11/18/2017 3:31 PM, Rostislav Pehlivanov wrote:
>>
>>
>>
>> On 18 November 2017 at 17:35, Rafal Dabrowa <fatwildcat at gmail.com> wrote:
>>
>> This is a proposal of performance optimizations for 8-bit
>> hevc video decoding on aarch64 platform with neon (simd) extension.
>>
>> I'm testing my optimizations on NanoPi M3 device. I'm using
>> mainly "Big Buck Bunny" video file in format 1280x720 for testing.
>> The video file was pulled from libde265.org page, see
>> http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
>> The movie duration is 00:10:34.53.
>>
>> Overall performance gain is about 2x. Without optimizations the movie
>> playback stops in practice after a few seconds. With
>> optimizations the file is played smoothly 99% of the time.
>>
>> For performance testing the following command was used:
>>
>>     time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe
>> - >/dev/null
>>
>> The video file was pre-read before test to minimize disk reads during
>> testing.
>> Program execution time without optimization was as follows:
>>
>> real    11m48.576s
>> user    43m8.111s
>> sys     0m12.469s
>>
>> Execution time with optimizations:
>>
>> real    6m17.046s
>> user    21m19.792s
>> sys     0m14.724s
>>
>>
>> The patch contains optimizations for most heavily used qpel, epel, sao and
>> idct
>> functions.  Among the functions provided for optimization there are two
>> intensively used, but not optimized in this patch:
>> hevc_v_loop_filter_luma_8
>> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
>> hence I leaved them without optimizations.
>>
>>
>>
>> Signed-off-by: Rafal Dabrowa <fatwildcat at gmail.com>
>> ---
>>  libavcodec/aarch64/Makefile               |    5 +
>>  libavcodec/aarch64/hevcdsp_epel_8.S       | 3949 ++++++++++++++++++++
>>  libavcodec/aarch64/hevcdsp_idct_8.S       | 1980 ++++++++++
>>  libavcodec/aarch64/hevcdsp_init_aarch64.c |  170 +
>>  libavcodec/aarch64/hevcdsp_qpel_8.S       | 5666
>> +++++++++++++++++++++++++++++
>>  libavcodec/aarch64/hevcdsp_sao_8.S        |  166 +
>>  libavcodec/hevcdsp.c                      |    2 +
>>  libavcodec/hevcdsp.h                      |    1 +
>>  8 files changed, 11939 insertions(+)
>>  create mode 100644 libavcodec/aarch64/hevcdsp_epel_8.S
>>  create mode 100644 libavcodec/aarch64/hevcdsp_idct_8.S
>>  create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
>>  create mode 100644 libavcodec/aarch64/hevcdsp_qpel_8.S
>>  create mode 100644 libavcodec/aarch64/hevcdsp_sao_8.S
> 
> 
> 
> Very nice.
> The way we test SIMD is to put START_TIMER("function_name"); and
> STOP_TIMER; (they're located in libavutil/timer.h) around where the
> function gets called in the C code, then we do a run with the C code (no
> SIMD) and a separate run with whatever SIMD optimizations we're
> implementing. We take the last printed value of both runs and that's what's
> used to measure speedup.
> 
> I don't think there's a need to split the patch into multiple patches for
> each idividual version though yet, that's usually only done if some
> function's C implementation is faster than the SIMD code.

It would be nice however to at least split it into two patches, one for
MC and one for SAO.

Also, no way to use macros in aarch64 asm files? ~11k lines of code is a
lot to add, and I'm sure a sizable portion is duplicated with only some
small differences between functions.