[FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC

Mon May 7 18:13:27 EEST 2018

Am 07.05.2018 um 17:05 schrieb Oscar Amoros Huguet:
> To clarify a bit what I was saying in the last email. When I said CUDA non-blocking streams, I meant non-default streams. All non-blocking streams are non-default streams, but non-default streams can be blocking or non-bloking with respect to the default streams. https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html
> 
> So, using cuMemcpyAsync, would allow the memory copies to overlap with any other copy or kernel execution, enqueued in any other non-default stream. https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/
> 
> If cuStreamSynchronize has to be called right after the last cuMemcpyAsync call, I see different ways of implementing this, but probably you will most likely prefer the following:
> 
> Add the cuMemcpyAsync to the list of cuda functions.
> Add a field in AVCUDADeviceContext of type CUstream, and set it to 0 (zero) by default. Let's name it "CUstream cuda_stream"?
> Call always cuMemcpyAsync instead of cuMemcpy, passing cuda_stream as the last parameter. cuMemcpyAsync(..., ..., ..., cuda_stream);
> After the last cuMemcpyAsync, call cuStreamSynchronize on cuda_stream. cuStreamSynchronize(cuda_stream);
> 
> If the user does not change the context and the stream, the behavior will be exactly the same as it is now. No synchronization hazards. Because passing "0" as the cuda stream, makes the calls blocking, as if they weren't asynchronous calls.
> 
> But, if the user wants the copies to overlap with the rest of it's application, he can set it's own cuda context, and it's own non-default stream.
> 
> In any of the cases, ffmpeg does not have to handle cuda stream creation and destruction, which makes it simpler.
> 
> Hope you like it!

A different idea I'm looking at right now is to get rid of the memcpy 
entirely, turning the mapped cuvid frame into an AVFrame itself, with a 
buffer_ref that unmaps the cuvid frame when freeing it, instead of 
allocating a whole new buffer and copying it over.
I'm not sure how that will play out with available free surfaces, but I 
will test.

I'll also add the stream basically like you described, as it seems 
useful to have around anyway.

If previously mentioned approach does not work, I'll implement this like 
described, probably for all cuMemCpy* in ffmpeg, as it at least does run 
the 2/3 plane copys asynchronous. Not sure if it can be changed to 
actually do them in parallel.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3994 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20180507/5c0038d3/attachment.bin>