[FFmpeg-devel] [patch][OpenHEVC]added ASM functions for epel + qpel

Pierre Edouard Lepere Pierre-Edouard.Lepere at insa-rennes.fr
Mon Mar 3 10:29:27 CET 2014

thanks for the feedback
here's a patch adding the align 16 for the local arrays.
this patch also replaces most of the repeating code with a %rep, making things more readable and the file is quite smaller.

I'm not looking into adding 32-bit support right now.

for timings, on BasketBallDrive_1920x1080_50_qp27, I go from 26 without optimizations to 21 seconds with them when decoding on a single thread.

- Pierre-Edouard

----- Mail original -----
De: "Christophe Gisquet" <christophe.gisquet at gmail.com>
À: "FFmpeg development discussions and patches" <ffmpeg-devel at ffmpeg.org>
Envoyé: Samedi 1 Mars 2014 08:11:48
Objet: Re: [FFmpeg-devel] [patch][OpenHEVC]added ASM functions for epel +	qpel


2014-02-28 15:24 GMT+01:00 Pierre Edouard Lepere
<Pierre-Edouard.Lepere at insa-rennes.fr>:
> here are 2 patches for the HEVC decoder :
> 1) changes in the C for epel and qpel. it is now possible to have fixed-width functions for each epel/qpel function.
> 2) adding ASM files. each function has a fixed width and has its loop unrolled.

A very cursory look from me.

You now have arrays that avoid unpacking the coefficients. Good.
Please put an "align 16" (32 for avx2?) on the line before
hevc_epel_filters_asm_8 to guarantee the coeffs addresses are aligned.

A next step would be to do (eg in QPEL_FILTER) something like:
%if ARCH_X86_64
movdqa           m12, [rfilterq + %2q + 16]
%define COEFFS23 m12
%define COEFFS23 [rfilterq + %2q + 16]
But someone caring for 32bits systems may do that in your stead.

I also see you doing a lot of movdqu m?, [%2q+N] with -4<N<5. I think
this qualifies for SSSE3's palignr but this might need some
benchmarking to validate.

And that's the final comment. I don't know how you validate your
changes besides validness, but it is nice providing timings to compare
before/after. If you decide to do that, include "libavutil/timer.h"
and add {START_TIMER and STOP_TIMER("some name")} around the
benchmarked function, run the program and check the decicycles
reported. It may be require some logging flag on the command-line for

Make sure your CPU does not {under,over}clock across measurements by
setting an appropriate power profile.

ffmpeg-devel mailing list
ffmpeg-devel at ffmpeg.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-added-align-16-for-local-variables.patch
Type: text/x-patch
Size: 79394 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140303/98b2fe08/attachment.bin>

More information about the ffmpeg-devel mailing list