[FFmpeg-user] Illegal memory access when using nvdec

Dennis Mungai dmngaie at gmail.com
Sun Mar 8 07:36:08 EET 2020


On Tue, 3 Mar 2020 at 21:23, Philip Langdale <philipl at overt.org> wrote:
>
> On Sun, 1 Mar 2020 07:16:05 +0300
> Dennis Mungai <dmngaie at gmail.com> wrote:
>
> > Hello there,
> >
> > I've ran into some scenarios where a long running FFmpeg process
> > configured to use NVDEC crashes with the error message: illegal
> > memory access, something related to cuda.
> >
> > I'm unable to consistently reproduce this issue with concurrent runs
> > as I'm transcoding live channels (provided as mpegts udp streams).
> > When I'm back on my desk I'll try to copy and paste the exact error
> > message and the FFmpeg command used.
> >
> > Are there private options that can be passed to the NVDEC hwaccel for
> > maximum stability? I've seen the use of -extra_hw_frames 2 being
> > recommended on a related ticket which presented a segfault when
> > handling encodes with B-frames and deinterlace filter in the same
> > flow, but I'm unable to replicate such a workaround.
> >
> > Warm regards,
> >
> > Dennis.
>
> I'd need to see the error, and ideally a backtrace to even begin
> investigating this. From your description, if it's a SIGABRT from
> inside the cuda library, then it's likely an internal cuda issue -
> perhaps related to a memory leak that only becomes an issue for very
> long decode periods. And then nvidia people would need to look at it.
>
> Thanks,
>
>
> --phil

Hello there,

I think I've stumbled upon the solution.

The fix is to set this variable in place: CUDA_DEVICE_ORDER=PCI_BUS_ID

When you have multiple NVENC capable GPUs on the same host, by
default, the device index returned differs from what nvidia-smi throws
back at you because for whatever reason, NVIDIA always assigns GPU
index 0 to the assumed "fastest GPU on the system", or even worse,
what's assumed to be the first PCI slot ON BOOT (and this can change
over multiple runs), and this heuristic runs even IF identical GPUs
are installed.

The real disaster unveils when both hwaccel nvdec/cuda is in use, with
a -hwaccel_output_format set (preventing download of textures to
system memory) and -hwaccel_device is set to a specific device because
even in the same run, that device index *will* change *if* a filter
chained to the hwaccel, say scale_npp or scale_cuda is re-initialized.
On resumption, it is not guaranteed that the device index known to the
prior context matches up and boom, a segfault (as described above).

Setting that variable above completely eliminates the problem. Back to
happy camping :-)

Carl's observation, backed by Phil, proved to be most telling: Just
because it triggers a segfault doesn't necessarily mean it's FFmpeg
problem.

Documentation on the same:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars

And other threads mentioning similar issues with CUDA's default device
handling behavior with multiple GPUs in place:

1. https://devtalk.nvidia.com/default/topic/605113/cuda-programming-and-performance/no-gpu-selected-code-working-properly-hows-this-possible-/?offset=11#3939141

2. https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/

Hope this is of help to someone else who stumbles on the same issue(s).


More information about the ffmpeg-user mailing list