[FFmpeg-trac] #9285(avcodec:new): Excessive GPU memory usage with nvdec hwaccel

Wed Jun 9 03:42:45 EEST 2021

#9285: Excessive GPU memory usage with nvdec hwaccel
-------------------------------------+-------------------------------------
             Reporter:  Ridley       |                     Type:  defect
  Combs                              |
               Status:  new          |                 Priority:  normal
            Component:  avcodec      |                  Version:
             Keywords:  nvdec        |  unspecified
  nvidia                             |               Blocked By:
             Blocking:               |  Reproduced by developer:  1
Analyzed by developer:  1            |
-------------------------------------+-------------------------------------
 When decoding video using the CUDA hwaccel, `ff_nvdec_decode_init()` sets
 both `ulNumDecodeSurfaces` and `ulNumOutputSurfaces` to
 `frames_ctx->initial_pool_size`, which in turn is set by
 `ff_nvdec_decode_init` to `dpb_size + 2`, which in turn has 3 added by
 `ff_decode_get_hw_frames_ctx()` and `extra_hw_frames` + `thread_count`
 added by `avcodec_get_hw_frames_parameters`.

 This is excessive. Only `ulNumDecodeSurfaces` needs additional frames
 based on thread count (the output surfaces are only used in
 `nvdec_retrieve_data`, which runs on the consumer's single thread), while
 only `ulNumOutputSurfaces` needs the 3 additional output frames from
 `ff_decode_get_hw_frames_ctx()` or the ones from `extra_hw_frames` (the
 decode surfaces are never exposed to the consumer).

 I'm not sure what the best way to handle this is. Maybe nvdec should
 ignore what the generic code sets `initial_pool_size` to altogether and
 instead calculate its buffer counts internally, duplicating the generic
 code's behavior only where appropriate? The `initial_pool_size` value
 seems to be designed for systems where the decoder's internal buffered
 frames are returned directly to the user, but that's not the case here.

 Additionally, it doesn't seem like multithreading in CUDA actually serves
 any purpose; I see no performance gain when using multiple threads vs 1.
 Is it useful with any hardware decoder? Should we be defaulting
 multithreading off when using a hwaccel, or forcing it off unless the
 hwaccel fails and software fallback occurs? This can result in some pretty
 hefty memory usage for no reason by default on many-core machines.
-- 
Ticket URL: <https://trac.ffmpeg.org/ticket/9285>
FFmpeg <https://ffmpeg.org>
FFmpeg issue tracker