[FFmpeg-devel] Added - HW accelerated H.264 and HEVC encoding for AMD GPUs based on AMF SDK

Sun Oct 29 22:48:26 EET 2017

> -----Original Message-----
> From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.org] On Behalf
> Of Mark Thompson
> Sent: October 29, 2017 3:36 PM
> To: ffmpeg-devel at ffmpeg.org
> Subject: Re: [FFmpeg-devel] Added - HW accelerated H.264 and HEVC
> encoding for AMD GPUs based on AMF SDK
> 
> On 29/10/17 18:39, Mironov, Mikhail wrote:
> >>>      REGISTER_ENCODER(H263_V4L2M2M,      h263_v4l2m2m);
> >>>      REGISTER_ENCDEC (LIBOPENH264,       libopenh264);
> >>> +    REGISTER_ENCODER(H264_AMF,          h264_amf);
> >>> +	REGISTER_ENCODER(H264_AMF,          h264_amf_d3d11va);
> >>
> >> No tabs.  Why is the d3d11 version separate?  The encoder should be
> >> able to accept multiple pixfmts.
> >>
> >
> > It does accept multiple formants but there is a code that searches for
> > accelerator name in the encoder name and unless it is present disables
> > passing accelerator to encoder. See hw_device_setup_for_encode().
> 
> That code is a temporary hack, please don't assume it.  A setup something
> like <https://lists.libav.org/pipermail/libav-devel/2017-
> October/085223.html> is intended to replace it (not yet done for encode),
> which will not require that sort of nastiness.  (Feel free to comment on that.)

OK, this would be way better. I also didn’t like the codec duplication. 
So for now I will remove it and mark the codec as HW when available.

> 
> I'm not sure how much you want an implicit device here anyway - you get a
> device from hw_frames_ctx for D3D11 input, and otherwise it doesn't
> matter?
> 
> >>> +    { AV_PIX_FMT_NV12,       AMF_SURFACE_NV12 },
> >>> +    { AV_PIX_FMT_BGRA,       AMF_SURFACE_BGRA },
> >>> +    { AV_PIX_FMT_ARGB,       AMF_SURFACE_ARGB },
> >>> +    { AV_PIX_FMT_RGBA,       AMF_SURFACE_RGBA },
> >>> +    { AV_PIX_FMT_GRAY8,      AMF_SURFACE_GRAY8 },
> >>> +    { AV_PIX_FMT_YUV420P,    AMF_SURFACE_YUV420P },
> >>> +    { AV_PIX_FMT_BGR0,       AMF_SURFACE_BGRA },
> >>> +    { AV_PIX_FMT_YUV420P,    AMF_SURFACE_YV12 },
> >>> +    { AV_PIX_FMT_YUYV422,    AMF_SURFACE_YUY2 },
> >>
> >> Do all of these formats actually work?
> >
> > This is just a translation table. Actual support is in
> > AVCodec::rix_fmts
> 
> You answered some of the actual question responding to Carl, I'll reply there.
> 
> >>
> >>> +    { AV_PIX_FMT_D3D11,      AMF_SURFACE_NV12 },
> >>
> >> D3D11 surfaces need not be NV12.  The actual format is in
> >> AVHWFramesContext.sw_format - if you only support 8-bit then
> >> something nasty probably happens if you give it P010 surfaces.
> >>
> >
> > Agreed, but how should I define D3D11 NV12 as input format if I only have
> AV_PIX_FMT_D3D11?
> 
> Check sw_format afterwards.
> 

sw_format is part of AVHWFramesContext in hwcodec.h 
But how should I define pix_fmt array in AVCodec? For example, In nvenc.c 
is done via AV_PIX_FMT_CUDA. Is it wrong?

> >>> +};
> >>> +
> >>> +static enum AMF_SURFACE_FORMAT amf_av_to_amf_format(enum
> >> AVPixelFormat fmt)
> >>> +{
> >>> +    for (int i = 0; i < amf_countof(format_map); i++) {
> >>> +        if (format_map[i].av_format == fmt) {
> >>> +            return format_map[i].amf_format;
> >>> +        }
> >>> +    }
> >>> +    return AMF_SURFACE_UNKNOWN;
> >>> +}
> >>> +
> >>> +// virtual functions decalred
> >>
> >> What does this comment mean?
> >>
> >
> > These functions are virtual functions put in real virtual table.
> 
> IMO the comment is useless.  If you want to keep it, at least fix the typo.

I can remove it, no problem. Just wanted to explain.

> 
> >>> +
> >>> +    version_fun(&ctx->version);
> >>> +    init_fun(AMF_FULL_VERSION, &ctx->factory);
> >>> +    ctx->factory->pVtbl->GetTrace(ctx->factory, &ctx->trace);
> >>> +    ctx->factory->pVtbl->GetDebug(ctx->factory, &ctx->debug);
> >>
> >> Do all of these functions necessarily succeed?
> >>
> >
> > Yes.
> 
> Even in possible future API versions?  Why aren't they void, then?

Good question, I guess for consistency of in API. I will add error checking.

> 
> >>> +
> >>> +    // try to reuse existing DX device
> >>> +
> >>> +    if (avctx->hw_frames_ctx) {
> >>> +        AVHWFramesContext *device_ctx = (AVHWFramesContext*)avctx-
> >>> hw_frames_ctx->data;
> >>> +        if (device_ctx->device_ctx->type ==
> AV_HWDEVICE_TYPE_D3D11VA){
> >>> +            if (device_ctx->device_ctx->hwctx) {
> >>> +                AVD3D11VADeviceContext *device_d3d11 =
> >> (AVD3D11VADeviceContext *)device_ctx->device_ctx->hwctx;
> >>> +                res = ctx->context->pVtbl->InitDX11(ctx->context,
> >>> + device_d3d11-
> >>> device, AMF_DX11_1);
> >>> +                if (res == AMF_OK) {
> >>> +                    ctx->hw_frames_ctx = av_buffer_ref(avctx->hw_frames_ctx);
> >>> +                } else {
> >>> +                    av_log(avctx, AV_LOG_INFO, "amf_shared: avctx-
> >>> hw_frames_ctx has non-AMD device, switching to default");
> >>
> >> I'm not sure this is going to act sensibly - if the user has D3D11
> >> frames input on another device, does it work?
> >>
> >
> > If device is not AMD's the code is trying to create another device  - the
> compatible one .
> > In these cases the submission will be via system memory.
> 
> And that works with D3D11 frames as hw_frames_ctx on another device?

Why not? AMF will be initialized with a different device, 
in submission code it is detected, surface with system (host) memory is allocated,  
data will be copied using av_image_copy() and host memory will be submitted to AMF.

> 
> >>> +    surface->pVtbl->SetPts(surface, frame->pts);
> >>
> >> Does this accept the same range as frame->pts, including
> AV_NOPTS_VALUE?
> >>
> >
> > Yes, encoder doesn’t use pts, just passes the value through for
> convenience.
> 
> What does it do with them, then?

If caller submits a frame for encoding it should be time-stamped on the output. 
The encoder passes pts through with one caveat:  
If B-frames are used it will reorder the pts in monotonic order.
If original or other data are needed they an be passed through via custom propery.

> 
> >>> +
> >>> +int ff_amf_encode_frame(AVCodecContext *avctx, AVPacket *pkt,
> >>> +                        const AVFrame *frame, int *got_packet) {
> >>> +    int             ret = 0;
> >>> +    AMF_RESULT      res = AMF_OK;
> >>> +    AmfContext     *ctx = avctx->priv_data;
> >>> +    AMFSurface     *surface = 0;
> >>> +    AMFData        *data = 0;
> >>> +    amf_bool       submitted = false;
> >>> +
> >>> +    while (!submitted) {
> >>> +        if (!frame) { // submit drain
> >>> +            if (!ctx->eof) { // submit drain onre time only
> >>> +                res = ctx->encoder->pVtbl->Drain(ctx->encoder);
> >>> +                if (res == AMF_INPUT_FULL) {
> >>> +                    av_usleep(1000); // input queue is full: wait,
> >>> + poll and submit
> >> Drain again
> >>> +                                     // need to get some output and try again
> >>> +                } else if (res == AMF_OK) {
> >>> +                    ctx->eof = true; // drain started
> >>> +                    submitted = true;
> >>> +                }
> >>> +            }
> >>> +        } else { // submit frame
> >>> +            if (surface == 0) { // prepare surface from frame one time only
> >>> +                if (frame->hw_frames_ctx && ( // HW frame detected
> >>> +                                              // check if the same
> >>> + hw_frames_ctx as used in
> >> initialization
> >>> +                    (ctx->hw_frames_ctx &&
> >>> + frame->hw_frames_ctx->data == ctx-
> >>> hw_frames_ctx->data) ||
> >>> +                    // check if the same hw_device_ctx as used in initialization
> >>> +                    (ctx->hw_device_ctx &&
> >>> + ((AVHWFramesContext*)frame-
> >>> hw_frames_ctx->data)->device_ctx ==
> >>> +                    (AVHWDeviceContext*)ctx->hw_device_ctx->data)
> >>> +                )) {
> >>> +                    ID3D11Texture2D* texture =
> >>> + (ID3D11Texture2D*)frame-
> >>> data[0]; // actual texture
> >>> +                    int index = (int)(size_t)frame->data[1]; //
> >>> + index is a slice in
> >> texture array is - set to tell AMF which slice to use
> >>> +                    texture->lpVtbl->SetPrivateData(texture,
> >> &AMFTextureArrayIndexGUID, sizeof(index), &index);
> >>> +
> >>> +                    res =
> >>> + ctx->context->pVtbl->CreateSurfaceFromDX11Native(ctx-
> >>> context, texture, &surface, NULL); // wrap to AMF surface
> >>> +                    surface->pVtbl->SetCrop(surface, 0, 0,
> >>> + frame->width, frame-
> >>> height); // decode surfaces are vertically aligned by 16 tell AMF
> >>> real size
> >>> +                    surface->pVtbl->SetPts(surface, frame->pts);
> >>> +                } else {
> >>> +                    res =
> >>> + ctx->context->pVtbl->AllocSurface(ctx->context,
> >> AMF_MEMORY_HOST, ctx->format, avctx->width, avctx->height,
> &surface);
> >>> +                    AMF_RETURN_IF_FALSE(ctx, res == AMF_OK,
> >>> + AVERROR_BUG,
> >> "AllocSurface() failed  with error %d", res);
> >>> +                    amf_copy_surface(avctx, frame, surface);
> >>> +                }
> >>> +            }
> >>> +            // encode
> >>> +            res = ctx->encoder->pVtbl->SubmitInput(ctx->encoder,
> >> (AMFData*)surface);
> >>> +            if (res == AMF_INPUT_FULL) { // handle full queue
> >>> +                av_usleep(1000); // input queue is full: wait, poll
> >>> + and submit
> >> surface again
> >>
> >> Is there really no way in the API to wait for this properly?
> >>
> >
> > The AMF runtime is designed without threads and sleeps inside. It is
> > up to  application to poll output and this way make space in the input HW
> queue.
> > But if input queue is really full it does make sense to sleep and
> > continue polling to avoid unnecessary CPU burn.
> 
> Some APIs have a way to wait for an event for this rather than polling.
> Polling like this will wake the CPU pointlessly and waste power.  (And yes, I
> know lots of other places do it, but if you have a proper API for this then
> please use it.)

At this point there is no way to wait on completion. It may be available in the future.

> 
> >>> +            } else {
> >>> +                surface->pVtbl->Release(surface);
> >>> +                surface = NULL;
> >>> +                AMF_RETURN_IF_FALSE(ctx, res == AMF_OK,
> >> AVERROR_UNKNOWN, "SubmitInput() failed with error %d", res);
> >>> +                submitted = 1;
> >>> +            }
> >>> +        }
> >>> +        // poll results
> >>> +        if (!data) {
> >>> +            res = ctx->encoder->pVtbl->QueryOutput(ctx->encoder, &data);
> >>> +            if (data) {
> >>> +                AMFBuffer* buffer;
> >>> +                AMFGuid guid = IID_AMFBuffer();
> >>> +                data->pVtbl->QueryInterface(data, &guid,
> >>> + (void**)&buffer); //
> >> query for buffer interface
> >>> +                ret = amf_copy_buffer(avctx, pkt, buffer);
> >>> +                if (!ret)
> >>> +                    *got_packet = 1;
> >>> +                buffer->pVtbl->Release(buffer);
> >>> +                data->pVtbl->Release(data);
> >>> +                if (ctx->eof) {
> >>> +                    submitted = true; // we are in the drain state - no
> submissions
> >>> +                }
> >>> +            } else if (res == AMF_EOF) {
> >>> +                submitted = true; // drain complete
> >>> +            } else {
> >>> +                if (!submitted) {
> >>> +                    av_usleep(1000); // wait and poll again
> >>> +                }
> >>> +            }
> >>> +        }
> >>> +    }
> >>
> >> I suspect this setup is not actually going to follow the constraints
> >> of the deprecated encode2().  Given the API here, I think you would
> >> be much better off writing with send_frame()/receive_packet().
> >
> > I considered this, but without a thread that would call
> > receive_packet() the implementation will fall into the same pattern as it is
> now but with an additional queue of ready outputs.
> 
> See the documentation - you can return EAGAIN to send_packet() when a
> frame is available, no external queue is required.

I didn’t debug it but from visual code inspection this logic is available for decoders only. 
For encoder call avcodec_send_frame() inside do_video_out() doesn’t check for EAGAIN 
and inside avcodec_send_frame() there is no such checking.

> 
> >>> +    AMFSize             framesize = AMFConstructSize(avctx->width, avctx-
> >>> height);
> >>> +    AMFRate             framerate = AMFConstructRate(avctx-
> >time_base.den,
> >> avctx->time_base.num * avctx->ticks_per_frame);
> >>
> >> Is VFR encoding not supported?
> >>
> >
> > The encoder uses frame rate only for rate control. It does not take in
> > account frame duration in case of VFR.
> 
> Sure.  (That should probably be documented somewhere.)
> 
> >>> +    if (avctx->rc_initial_buffer_occupancy != 0) {
> >>> +        int percent = avctx->rc_buffer_size * 64 / avctx-
> >>> rc_initial_buffer_occupancy;
> >>> +        if (percent > 64)
> >>> +            percent = 64;
> >>
> >> ???
> >>
> >
> > This is an attempt to translate to 1-64 range which is exposed by the
> encoder.
> 
> 1-64, yet the variable is called percent?  Sounds very suspicious.  If that is
> indeed the behaviour, please add a comment explaining it.

Sure, the AMF parameter is 1- to 64,. "percent"  is not a good name for this variable.

> 
> >>> +    // Bitrate
> >>> +    AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder,
> >> AMF_VIDEO_ENCODER_TARGET_BITRATE, avctx->bit_rate);
> >>> +    if (ctx->rate_control_mode ==
> >> AMF_VIDEO_ENCODER_RATE_CONTROL_METHOD_CBR) {
> >>> +        AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder,
> >> AMF_VIDEO_ENCODER_PEAK_BITRATE, avctx->bit_rate);
> >>> +    } else {
> >>> +        int rc_max_rate = avctx->rc_max_rate >= avctx->bit_rate ?
> >>> + avctx-
> >>> rc_max_rate : avctx->bit_rate * 13 / 10;
> >>
> >> Where does 13/10 come from?
> >>
> >
> > For best results rc_max_rate should be bigger then bit_rate  for VBR. For
> CBR it is ignored.
> > What is the better way to correct an unset parameter?
> 
> Max rate should only be constrained by the HRD buffering in this case.  Are
> you sure this isn't handled internally if you don't supply the paramater at all?
> If not, maybe supply some sort of infinity to avoid it constraining anything
> appropriately.

I concern about VBR with peaks. For this mode max rate defines height of the peaks. 
If not defined the encoder will put something valid, but idea was to control such thing explicitly 
from FFmpeg.

> 
> >>> +
> >>> +    // fill extradata
> >>> +    AMFVariantInit(&var);
> >>
> >> Can this fail?
> >>
> > Not if var cleared with 0.
> 
> I would prefer that return values are checked for all functions which return
> them, in case of future API changes.  If the function can't return an error,
> why isn't it void?

I guess I will just add error checking everywhere  to avoid confusion.

> 
> >>> +// encoder connected with D3D11 HW accelerator AVCodec
> >>> +ff_h264_amf_d3d11va_encoder = {
> >>> +    .name = "h264_amf_d3d11va",
> >>> +    .long_name = NULL_IF_CONFIG_SMALL("AMD AMF H.264 Encoder
> with
> >> d3d11va"),
> >>> +    .type = AVMEDIA_TYPE_VIDEO,
> >>> +    .id = AV_CODEC_ID_H264,
> >>> +    .init = amf_encode_init_h264,
> >>> +    .encode2 = ff_amf_encode_frame,
> >>> +    .close = ff_amf_encode_close,
> >>> +    .priv_data_size = sizeof(AmfContext),
> >>> +    .priv_class = &h264_amf_d3d11va_class,
> >>> +    .defaults = defaults,
> >>> +    .capabilities = AV_CODEC_CAP_DELAY,
> >>> +    .caps_internal = FF_CODEC_CAP_INIT_CLEANUP,
> >>> +    .pix_fmts = ff_amf_pix_fmts,
> >>> +};
> >>
> >> As above, why does this separate (identical) instance exist?
> >
> > See explanation about accelerator handling for encoders above.
> >
> >>> +static const AVOption options[] = {
> >>> +    { "usage",          "Set the encoding usage",             OFFSET(usage),
> >> AV_OPT_TYPE_INT,   { .i64 =
> >> AMF_VIDEO_ENCODER_HEVC_USAGE_TRANSCONDING },
> >> AMF_VIDEO_ENCODER_HEVC_USAGE_TRANSCONDING,
> >> AMF_VIDEO_ENCODER_HEVC_USAGE_WEBCAM, VE, "usage" },
> >>> +    { "transcoding",    "", 0, AV_OPT_TYPE_CONST, { .i64 =
> >> AMF_VIDEO_ENCODER_HEVC_USAGE_TRANSCONDING },         0, 0, VE,
> >> "usage" },
> >>> +    { "ultralowlatency","", 0, AV_OPT_TYPE_CONST, { .i64 =
> >> AMF_VIDEO_ENCODER_HEVC_USAGE_ULTRA_LOW_LATENCY },    0, 0, VE,
> >> "usage" },
> >>> +    { "lowlatency",     "", 0, AV_OPT_TYPE_CONST, { .i64 =
> >> AMF_VIDEO_ENCODER_HEVC_USAGE_LOW_LATENCY },          0, 0, VE,
> >> "usage" },
> >>> +    { "webcam",         "", 0, AV_OPT_TYPE_CONST, { .i64 =
> >> AMF_VIDEO_ENCODER_HEVC_USAGE_WEBCAM },               0, 0, VE,
> "usage" },
> >>
> >> Could some of this be in common with the H.264 encoder?  (Maybe in
> >> the
> >> header?)
> >>
> >
> > I tried to keep H264 and HAVC parameters completely separate. I was
> > asked by codec team to do so in AMF API  and did the same here.
> 
> Ok, sure.
> 
> 
> Some more random questions:
> 
> * How can we supply colour information to the codecs?  (VUI
> colour_primaries/transfer_characteristics/matrix_coefficients/chroma_samp
> le_loc_type.)

There is very limited set of VUI parameters available: timing (shared with rate control via frame rate), 
aspect ratio, and video_full_range_flag, bit stream restriction and few other related to reordering .

> 
> * Does timing SEI (buffering_period/pic_timing) get generated for bitrate-
> targetted modes?  Is there any way to control that?

I will check tomorrow if it they are generated but there is no control over it.

> 
> * When you say this will be used on Linux, is that going to be in Mesa
> (integrated with VAAPI as the Windows code is with D3D?) or will it be
> something in the proprietary drivers?

VAAPI on Linux is available in open source driver. AMF version will be implemented via 
Vulkan and will follow Vulkan implementation  in the driver and in open source policy.  
I cannot say more right now. 

> 
> Thanks,
> 
> - Mark
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Thanks, Mikhail