[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Mon May 18 21:41:37 CEST 2015

On 18.05.2015, at 12:37, Stefano Sabatini <stefasab at gmail.com> wrote:

> On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini <stefasab at gmail.com>
> wrote:
> 
>> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
>>> On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
>> [...]
>>>> One limitation is as the manual said, it needs to be copied from the
>>>> GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
>>>> copy function for this, it uses plain old memcpy.
>>>> Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
>>>> is optimized for copying from USWC memory (Uncacheable Speculative
>>>> Write Combining) to system memory. Using this may help speed up the
>>>> process significantly, and VLC probably uses it.
>>> 
>>> Now the question is, how would be possible to optimize GPU to CPU copy
>>> to get an overall performance gain? At least VLC seems able to get
>>> better performances when using HW decoding, but I'm not sure it is
>>> copying decoded data back to the CPU (indeed it may perform direct
>>> rendering).
>> 
>> Self-reply:
>> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
>> Author: Laurent Aimar <fenrir at videolan.org>
>> Date:   Tue Nov 17 01:09:43 2009 +0100
>> 
>>    Improved performance when copying video surface in dxva2.
>> 
>> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
>> instructions are available.
>> 
> 
> I have a first hackish patch, performed some tests and I got some
> significant performance gains, on my iCore5 with Intel Graphics HD4000 I
> have now the same performance as the software decoder using DXVA2 for
> decoding a H.264 1920x1080 video, but using only a single thread. The patch
> as is is a hack, since I had to modify the compilation flags to enable
> assembly compilation in the ffmpeg_dxva2.c file. I should probably create
> an optimized copy function in libavutil, comments are welcome.

What exactly is SSE4 needed for?
Both non-temporal movs and prefetches existed before it, so if that is critical for performance the fallback implementation is bad.
However possibly more important: why is a memcpy needed at all?