[FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC
sw at jkqxz.net
Sun Apr 22 22:15:30 EEST 2018
On 19/04/18 17:00, Oscar Amoros Huguet wrote:
> We changed 4 files in ffmpeg, libavcodec/nvdec.c, libavutil/hwcontext.c, libavutil/hwcontext_cuda.h, libavutil/hwcontext_cuda.c.
> The purpose of this modification is very simple. We needed, for performance reasons (per frame execution time), that nvdec.c used the same CUDA context as we use in our software.
> The reason for this is not so simple, and two fold:
> - We wanted to remove the overhead of having the GPU constantly switching contexts, as we use up to 8 nvdec instances at the same time, plus a lot of CUDA computations.
> - For video syncronization and buffering purposes, after decoding we need to download the frame from GPU to CPU, but in a non blocking and overlapped (with computation and other transfers) manner, so the impact of the transfer is almost zero.
> In order to do the later, we need to be able to synchronize our manually created CUDA stream with the CUDA stream being used by ffmpeg, which by default is the Legacy default stream.
> To do so, we need to be in the same CUDA context, otherwise we don't have access to the Legacy CUDA stream being used by ffmpeg.
> The conseqüence is, that without changin ffmpeg code, the transfer of the frame from GPU to CPU, could not be asynchronous, because if made asynchronous, it overlapped with the device to device cuMemcpy made internally by ffmpeg, and therefore, the resulting frames where (many times) a mix of two frames.
> So what did we change?
> - Outside of the ffmpeg code, we allocate an AVBufferRef with av_hwdevice_ctx_alloc(AV_HWDEVICE_TYPE_CUDA), and we access the AVCUDADeviceContext associated, to set the CUDA context (cuda_ctx).
> - We modified libavutil/hwcontext.c call av_hwdevice_ctx_create() so it detects that the AVBufferRef being passed, was allocaded externally. We don't check that AVHWDeviceType is AV_HWDEVICE_TYPE_CUDA. Let us know if you think we should check that, otherwise go back to default behavior.
> - If the AVBufferRef was allocated, then we skip the allocation call, and pass the data as AVHWDeviceContext type to cuda_device_create.
> - We modified libavutil/hwcontext_cuda.c in several parts:
> - cuda_device_create detects if there is a cuda context already present in the AVCUDADeviceContext, and if so, sets the new parameter AVCUDADeviceContext.is_ctx_externally_allocated to 1.
> - This way, all the succesive calls to this file, take into account that ffmpeg is not responsible for either the creation, thread binding/unbinding and destruction of the CUDA context.
> - Also, we skip context push and pop if the context was passed externally (specially in non initialization calls), to reduce the number of calls to the CUDA runtime, and improve the execution times of the CPU threads using ffmpeg.
> With this, we managed to have all the CUDA calls in the aplication, in the same CUDA context. Also, we use CUDA default stream per-thread, so in order to synch with the CUDA stream used by ffmpeg, we only had to put the GPU to CPU copy, to the globally accessible cudaStreamPerThread CUDA stream.
> So, of 33ms of available time we have per frame, we save more than 6ms, that where being used by the blocking copies from GPU to CPU.
> We considered further optimizing the code, by changing ffmpeg so it can internally access the cudaStreamPerThread, and cuMemcpyAsynch, so the DevicetoDevice copies are aslo asynchronous and overlapped with the rest of the computation, but the time saved is much lower, and we have other optimizations to do in our code, that can save more time.
> Nevetheless, if you find interesting this last optimization, let us know.
> Also, please, let us know any thing we did wrong or missed.
You've missed that the main feature you are adding is already present. Look at av_hwdevice_ctx_alloc() + av_hwdevice_ctx_init(), which uses an existing device supplied by the user; av_hwdevice_ctx_create() is only for creating new devices which will be managed internally.
I don't about how well the other part eliding context push/pop operations will work (someone with more Nvidia knowledge may wish to comment on that), but it shoudn't be dependent on whether the context was created externally. If you want to add that flag then it should probably be called something like "single global context" to make clear what it actually means. Also note that there are more push/pop pairs in the codebase (e.g. for NVENC and in libavfilter), and they may all need to be updated to respect this flag as well.
More information about the ffmpeg-devel