[FFmpeg-devel] [PATCH]: Change Stack Frame Limit in Cuda Context

Mon Feb 5 22:01:13 EET 2018

Hi,

Do we have a conclusion on whether this patch can be pushed in?

Thanks,
Ben

On Tue, Jan 30, 2018 at 4:25 PM, Ben Chang <benchang621 at gmail.com> wrote:

>
>
> On Fri, Jan 26, 2018 at 3:10 PM, Mark Thompson <sw at jkqxz.net> wrote:
>
>> On 26/01/18 20:51, Ben Chang wrote:
>> > On Fri, Jan 26, 2018 at 3:32 AM, Mark Thompson <sw at jkqxz.net> wrote:
>> >
>> >> On 26/01/18 09:06, Ben Chang wrote:
>> >>> Thanks for the review Mark.
>> >>>
>>
>> To clarify, since it is less clear now with the trimmed context: my two
>> comments about this change are completely independent.  (Given that, maybe
>> it should be split into two parts - one for hwcontext and one for nvenc?)
>
> Sorry for the delay in reply Mark; been caught up by something else.
>
>>
>
>
>> This part is about the change to NVENC:
>>
>> >>>  There are some cuda kernels in the driver that may be invoked
>> depending
>> >> on
>> >>> the nvenc operations specified in the commandline. My observation from
>> >>> looking at the nvcc statistics is that most stack frame size for these
>> >> cuda
>> >>> kernels are 0 (highest observed was 120 bytes).
>> >>
>> >> Right, that makes sense.  If Nvidia is happy that this will always
>> work in
>> >> drivers compatible with this API version (including any future ones)
>> then
>> >> sure.
>> >>
>> > I am not saying this should be the "permanent" value for stack frame
>> size
>> > per GPU thread. However, at this moment (looking at existing cuda
>> kernels
>> > that devs have control over), I do not see this reduction as an issue.
>>
>> I think you should be confident that the chosen value here will last well
>> into the future for NVENC use.  Consider that this will end up in releases
>> - if a future Nvidia driver update happens to need a larger stack then all
>> previous releases and binaries will stop working for all users.
>>
>>
>> This part is about the change to the hwcontext device creation:
>>
>> >>>> This is technically a user-visible change, since it will apply to all
>> >> user
>> >>>> programs run on the CUDA context created here as well as those inside
>> >>>> ffmpeg.  I'm not sure how many people actually use that, though, so
>> >> maybe
>> >>>> it won't affect anyone.
>> >>>>
>> >>> In ffmpeg, I see vf_thumbnail_cuda and vf_scale_cuda available (not
>> sure
>> >> if
>> >>> there is more, but these two should not be affected by this
>> reduction).
>> >>> User can always raise the stack limit size if their own custom kernel
>> >>> require higher stack frame size.
>> >>
>> >> I don't mean filters inside ffmpeg, I mean a user program which
>> probably
>> >> uses NVDEC and/or NVENC (and possibly other things) from libavcodec but
>> >> then does its own CUDA processing with the same context.  This is
>> silently
>> >> changing the setup underneath it, and 128 feels like a very small
>> number.
>> >>
>> > Yes, this is really a trade off between reducing memory usage (since
>> there
>> > are numerous complaints of high memory usage preventing having more
>> ffmpeg
>> > instances) and user convenience (custom cuda implementation may be
>> > impacted). My thought (which can be wrong) is that users who implement
>> > their own cuda kernel may have better knowledge about cuda (eg. how much
>> > stack frame size their kernel needs or use cuda debugger to find out
>> what
>> > issue they may have). The size of the kernels are really implementation
>> > dependent (eg, allocating arrays in stacks or heap, recursions, how much
>> > register spills, etc) so stack frame sizes may vary widely. The default,
>> > 1024 bytes, may not be enough at times and user will need to adjust the
>> > stack limit accordingly anyway.
>>
>> Note that since you are changing a library the users in this context are
>> all ffmpeg library users.  So, it means any program or library which uses
>> ffmpeg, and transitively anyone who uses them.  The end-user need not be
>> informed about CUDA at all.
>>
>> (What you've said also makes it sound like it can change by compiler
>> version, but I guess such changes should be small.)
>>
>> >>>>
>> >>>> If the stack limit is violated, what happens?  Will that be undefined
>> >>>> behaviour with random effects (crash / incorrect results), or is it
>> >> likely
>> >>>> to be caught at program compile/load-time?
>> >>>>
>> >>> Stack will likely overflow and kernel will terminate (though I have
>> yet
>> >>> encounter this before).
>> >>
>> >> As long as the user gets a clear message that a stack overflow has
>> >> occurred so that they can realise that they need to raise the value
>> then it
>> >> should be fine.
>> >
>> > I believe you will see stack overflow if attached to cuda debugger. But
>> the
>> > default error may just be kernel launch error/failure. This goes back
>> to my
>> > opinion that cuda developer should figure this out relatively easy if
>> they
>> > want to customize the cuda part of their program.
>>
>> Ok, sure.  It's possibly unfortunate if a user who knows nothing about
>> CUDA rebuilds a program from source with newer ffmpeg libraries including
>> this change because something could stop working in an opaque way, but if
>> it errors out then it won't silently give the wrong answer and should be
>> diagnosable by the original developer.
>
> I have asked my colleagues about the effects of cuda kernel's stack frame
> exceeding the default stack limit size. Cuda driver should automatically
> increase the stack limit / allocate more memory. It will fail if there is
> no more memory available.
>
> I have done some experiments to validate this (by creating various sizes
> of arrays on stack in a cuda kernel). This is based on setting
> cuCtxSetLimit to 128 bytes in stack limit in ffmpeg. Having about 40000
> bytes of stack frame size in kernel will not result in failure (extra
> memory allocated by driver observed in GPU-Z). Its only when I have
> something ridiculous like 300,000 stack frame size does the encode fail
> with kernel launch failure.
>
> Therefore, I do not think this patch of reducing default stack limit will
> cause regression for users (both in nvenc & hwcontext device creation).
>
> In short, I am trying to say there is a fallback if the initial stack
> limit set if not enough. This fallback only fails there is no more memory
> available on gpu (in this cause, user will stumble upon this even with the
> default 1024 bytes).
>
> Thanks,
> Ben
>
>