[FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
Oscar Amoros Huguet
oamoros at mediapro.tv
Thu Apr 26 19:03:00 EEST 2018
You are right, we can implement in our code a sort of "av_hwdevice_ctx_set" (which does not exist), by using av_hwdevice_ctx_alloc() + av_hwdevice_ctx_init(). We actually use av_hwdevice_ctx_alloc in our code to use the feature we implemented already. We are not sure about license implications though, we link dynamically to work with LGPL. I guess both calls are public, since they are not in "internal" labelled files.
We are perfectly ok with using av_hwdevice_ctx_alloc() + av_hwdevice_ctx_init() outside ffmpeg, to use our own CUDA context. By doing so, in the current ffmpeg code, there is an internal flag " AVCUDADeviceContextInternal.is_allocated" that is not set to 1, therefore, the cuda context is not destroyed by ffmpeg in "cuda_device_uninit", which is the desired behavior.
In fact, this flag implies that the context was not allocated by ffmpeg. Maybe this is the right flag to be used to avoid push/pop pairs when the CUDA context is not created by ffmpeg. What do you think?
We can adapt all of the push/pop pairs on the code, to follow this policy, whichever flag is used.
About the performance effects of this push/pop calls, we have seen with NVIDIA profiling tools (NSIGHT for Visual Studio plugin), that the CUDA runtime detects that the context you wat to set is the same as the one currently set, so the push call does nothing and lasts 0.0016 ms in average (CPU time). But for some reason, the cuCtxPopCurrent call, does take some more time, and uses 0.02 ms of CPU time per call. This is 0,16 ms total per frame when decoding 8 feeds. This is small, but it's easy to remove.
Additionally, could you give your opinion on the feature we also may want to add in the future, that we mentioned in the previous email? Basically, we may want to add one more CUDA function, specifically cuMemcpy2DAsync, and the possibility to set a CUStream in AVCUDADeviceContext, so it is used with cuMemcpy2DAsync instead of cuMemcpy2D in "nvdec_retrieve_data" in file libavcodec/nvdec.c. In our use case this would save up to 0.72 ms (GPU time) per frame, in case of decoding 8 fullhd frames, and up to 0.5 ms (GPU time) per frame, in case of decoding two 4k frames. This may sound too little, but for us is significant. Our software needs to do many things in a maximum of 33ms with CUDA on the GPU per frame, and we have little GPU time left.
You can see all the explained above, in the following image: https://ibb.co/hASLZH
Thank you for your time.
More information about the ffmpeg-devel