[FFmpeg-devel] [Patch] CUDA Thumbnail Filter
sw at jkqxz.net
Mon Sep 11 13:03:28 EEST 2017
On 11/09/17 10:18, Timo Rothenpieler wrote:
> Am 11.09.2017 um 07:40 schrieb Yogender Gupta:
>>>> Only 3 to 4 times? This is easily doable with SIMD.
>> The problem is not with the thumbnail filter at all. The problem is doing the transfers from vidmem to sysmem or vice-versa. You will observe if we use a transcoder pipeline with and without hwaccel cuvid (using hw encoder/decoders in both cases), the one with hwaccel runs much faster. If we add more transfers by using a CPU based filter, it will only degrade the performance further.
>> The CUDA thumbnail filter can work directly on the video memory without requiring an additional vidmem to sysmem transfer.
> I also really don't see the concern with adding CUDA versions of already existing filters.
> They are not included in any standard build, and require both non-free and the cuda-sdk to be even built in the first place.
> For their specific use case of a fully hardware-accelerated transcode and filter pipeline they clearly offer benefits. Specially when the final encode is to be done with nvenc and/or when operating on huge frames(4K or maybe even bigger) using the GPU has clear benefits and I doubt any SIMD will be able to compensate for it.
> Another scenario where a 100% GPU pipeline becomes essential is when you are processing _a lot_ of streams on one machine. You can freely put more GPUs in and gain more VMEM and Cores to work with, without interfering with the others.
> If there is a single CPU based filter anywhere in that chain you will very quickly be bottlenecked by it and the copying to and from sysmem.
> Concerning the OpenCL infrastructure that was just posted to the list:
> It would indeed be nice if there was a way to map CUDA frames to OpenCL, and the other way around. But I am not aware of any interoperability there and Nvidia has more than big enough of a market share on server and cloud GPUs(see for example AWS) to make adding CUDA based filters worthwhile.
It would be nice, yes, but I'm not sure there is actually that much need with the current setup. The use-cases for the two as currently written don't really overlap - CUDA is useful in the cases you describe with (possibly multiple) high-power GPUs trying to squeeze as much performance as possible out of a system to run many streams, while my OpenCL stuff is intended to be useful on random low-power devices where doing more stuff on the GPU can make the difference between managing real-time or not on a small number of streams.
On this filter in particular, I find thumbnail a slightly weird choice to want to write a GPU version of, but if it works in essentially the same way as the software filter and someone has a use-case for it then sure. (* Not that I've actually read it, I'm not familiar with CUDA at all.) An "N times speedup" metric or comparison with some CPU implementation with SIMD is essentially irrelevant, because that isn't the point - even when slower than the CPU implementation there can still be value in it not running on the CPU (this probably won't happen with CUDA because it only runs on high-power devices, but it is certainly possible on mobile devices with OpenCL).
More information about the ffmpeg-devel