[FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC

Mon May 7 14:24:46 EEST 2018

On 26.04.2018 18:03, Oscar Amoros Huguet wrote:
> Thanks Mark,
> 
> You are right, we can implement in our code a sort of "av_hwdevice_ctx_set" (which does not exist), by using av_hwdevice_ctx_alloc() + av_hwdevice_ctx_init(). We actually use av_hwdevice_ctx_alloc in our code to use the feature we implemented already. We are not sure about license implications though, we link dynamically to work with LGPL. I guess both calls are public, since they are not in "internal" labelled files.
> 
> We are perfectly ok with using av_hwdevice_ctx_alloc() + av_hwdevice_ctx_init() outside ffmpeg, to use our own CUDA context. By doing so, in the current ffmpeg code, there is an internal flag " AVCUDADeviceContextInternal.is_allocated" that is not set to 1, therefore, the cuda context is not destroyed by ffmpeg in "cuda_device_uninit", which is the desired behavior.
> 
> In fact, this flag implies that the context was not allocated by ffmpeg. Maybe this is the right flag to be used to avoid push/pop pairs when the CUDA context is not created by ffmpeg. What do you think?
> 
> We can adapt all of the push/pop pairs on the code, to follow this policy, whichever flag is used.
> 
> About the performance effects of this push/pop calls, we have seen with NVIDIA profiling tools (NSIGHT for Visual Studio plugin), that the CUDA runtime detects that the context you wat to set is the same as the one currently set, so the push call does nothing and lasts 0.0016 ms in average (CPU time). But for some reason, the cuCtxPopCurrent call, does take some more time, and uses 0.02 ms of CPU time per call. This is 0,16 ms total per frame when decoding 8 feeds. This is small, but it's easy to remove. 

I'm not a fan of touching every single bit of CUDA-related code for
this. Push/Pop, specially for the context that's already active, should
be free. If it's not, that's something I'd complain to nvidia about.

For your specific usecase, you could build FFmpeg with a custom version
of the ffnvcodec headers, that has a custom function for the push/pop
ctx functions, practically noops.

> Additionally, could you give your opinion on the feature we also may want to add in the future, that we mentioned in the previous email? Basically, we may want to add one more CUDA function, specifically cuMemcpy2DAsync, and the possibility to set a CUStream in AVCUDADeviceContext, so it is used with cuMemcpy2DAsync instead of cuMemcpy2D in "nvdec_retrieve_data" in file libavcodec/nvdec.c. In our use case this would save up to  0.72 ms (GPU time) per frame, in case of decoding 8 fullhd frames, and up to 0.5 ms (GPU time) per frame, in case of decoding two 4k frames. This may sound too little, but for us is significant. Our software needs to do many things in a maximum of 33ms with CUDA on the GPU per frame, and we have little GPU time left.

This is interesting and I'm considering making that the default, as it
would fit well with the current infrastructure, delaying the sync call
to the moment the frame leaves avcodec, which with the internal
re-ordering and delay should give plenty of time for the copy to finish.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20180507/f6575a14/attachment.sig>