[FFmpeg-devel] Added - HW accelerated H.264 and HEVC encoding for AMD GPUs based on AMF SDK

Sun Oct 29 21:36:29 EET 2017

On 29/10/17 18:39, Mironov, Mikhail wrote:
>>>      REGISTER_ENCODER(H263_V4L2M2M,      h263_v4l2m2m);
>>>      REGISTER_ENCDEC (LIBOPENH264,       libopenh264);
>>> +    REGISTER_ENCODER(H264_AMF,          h264_amf);
>>> +	REGISTER_ENCODER(H264_AMF,          h264_amf_d3d11va);
>>
>> No tabs.  Why is the d3d11 version separate?  The encoder should be able to
>> accept multiple pixfmts.
>>
> 
> It does accept multiple formants but there is a code that searches for 
> accelerator name in the encoder name and unless it is present disables 
> passing accelerator to encoder. See hw_device_setup_for_encode().

That code is a temporary hack, please don't assume it.  A setup something like <https://lists.libav.org/pipermail/libav-devel/2017-October/085223.html> is intended to replace it (not yet done for encode), which will not require that sort of nastiness.  (Feel free to comment on that.)

I'm not sure how much you want an implicit device here anyway - you get a device from hw_frames_ctx for D3D11 input, and otherwise it doesn't matter?

>>> +    { AV_PIX_FMT_NV12,       AMF_SURFACE_NV12 },
>>> +    { AV_PIX_FMT_BGRA,       AMF_SURFACE_BGRA },
>>> +    { AV_PIX_FMT_ARGB,       AMF_SURFACE_ARGB },
>>> +    { AV_PIX_FMT_RGBA,       AMF_SURFACE_RGBA },
>>> +    { AV_PIX_FMT_GRAY8,      AMF_SURFACE_GRAY8 },
>>> +    { AV_PIX_FMT_YUV420P,    AMF_SURFACE_YUV420P },
>>> +    { AV_PIX_FMT_BGR0,       AMF_SURFACE_BGRA },
>>> +    { AV_PIX_FMT_YUV420P,    AMF_SURFACE_YV12 },
>>> +    { AV_PIX_FMT_YUYV422,    AMF_SURFACE_YUY2 },
>>
>> Do all of these formats actually work?
> 
> This is just a translation table. Actual support is in AVCodec::rix_fmts 

You answered some of the actual question responding to Carl, I'll reply there.

>>
>>> +    { AV_PIX_FMT_D3D11,      AMF_SURFACE_NV12 },
>>
>> D3D11 surfaces need not be NV12.  The actual format is in
>> AVHWFramesContext.sw_format - if you only support 8-bit then something
>> nasty probably happens if you give it P010 surfaces.
>>
> 
> Agreed, but how should I define D3D11 NV12 as input format if I only have AV_PIX_FMT_D3D11?

Check sw_format afterwards.

>>> +};
>>> +
>>> +static enum AMF_SURFACE_FORMAT amf_av_to_amf_format(enum
>> AVPixelFormat fmt)
>>> +{
>>> +    for (int i = 0; i < amf_countof(format_map); i++) {
>>> +        if (format_map[i].av_format == fmt) {
>>> +            return format_map[i].amf_format;
>>> +        }
>>> +    }
>>> +    return AMF_SURFACE_UNKNOWN;
>>> +}
>>> +
>>> +// virtual functions decalred
>>
>> What does this comment mean?
>>
> 
> These functions are virtual functions put in real virtual table.  

IMO the comment is useless.  If you want to keep it, at least fix the typo.

>>> +
>>> +    version_fun(&ctx->version);
>>> +    init_fun(AMF_FULL_VERSION, &ctx->factory);
>>> +    ctx->factory->pVtbl->GetTrace(ctx->factory, &ctx->trace);
>>> +    ctx->factory->pVtbl->GetDebug(ctx->factory, &ctx->debug);
>>
>> Do all of these functions necessarily succeed?
>>
> 
> Yes.

Even in possible future API versions?  Why aren't they void, then?

>>> +
>>> +    // try to reuse existing DX device
>>> +
>>> +    if (avctx->hw_frames_ctx) {
>>> +        AVHWFramesContext *device_ctx = (AVHWFramesContext*)avctx-
>>> hw_frames_ctx->data;
>>> +        if (device_ctx->device_ctx->type == AV_HWDEVICE_TYPE_D3D11VA){
>>> +            if (device_ctx->device_ctx->hwctx) {
>>> +                AVD3D11VADeviceContext *device_d3d11 =
>> (AVD3D11VADeviceContext *)device_ctx->device_ctx->hwctx;
>>> +                res = ctx->context->pVtbl->InitDX11(ctx->context, device_d3d11-
>>> device, AMF_DX11_1);
>>> +                if (res == AMF_OK) {
>>> +                    ctx->hw_frames_ctx = av_buffer_ref(avctx->hw_frames_ctx);
>>> +                } else {
>>> +                    av_log(avctx, AV_LOG_INFO, "amf_shared: avctx-
>>> hw_frames_ctx has non-AMD device, switching to default");
>>
>> I'm not sure this is going to act sensibly - if the user has D3D11 frames input
>> on another device, does it work?
>>
> 
> If device is not AMD's the code is trying to create another device  - the compatible one . 
> In these cases the submission will be via system memory.

And that works with D3D11 frames as hw_frames_ctx on another device?

>>> +    surface->pVtbl->SetPts(surface, frame->pts);
>>
>> Does this accept the same range as frame->pts, including AV_NOPTS_VALUE?
>>
> 
> Yes, encoder doesn’t use pts, just passes the value through for convenience.

What does it do with them, then?

>>> +
>>> +int ff_amf_encode_frame(AVCodecContext *avctx, AVPacket *pkt,
>>> +                        const AVFrame *frame, int *got_packet)
>>> +{
>>> +    int             ret = 0;
>>> +    AMF_RESULT      res = AMF_OK;
>>> +    AmfContext     *ctx = avctx->priv_data;
>>> +    AMFSurface     *surface = 0;
>>> +    AMFData        *data = 0;
>>> +    amf_bool       submitted = false;
>>> +
>>> +    while (!submitted) {
>>> +        if (!frame) { // submit drain
>>> +            if (!ctx->eof) { // submit drain onre time only
>>> +                res = ctx->encoder->pVtbl->Drain(ctx->encoder);
>>> +                if (res == AMF_INPUT_FULL) {
>>> +                    av_usleep(1000); // input queue is full: wait, poll and submit
>> Drain again
>>> +                                     // need to get some output and try again
>>> +                } else if (res == AMF_OK) {
>>> +                    ctx->eof = true; // drain started
>>> +                    submitted = true;
>>> +                }
>>> +            }
>>> +        } else { // submit frame
>>> +            if (surface == 0) { // prepare surface from frame one time only
>>> +                if (frame->hw_frames_ctx && ( // HW frame detected
>>> +                                              // check if the same hw_frames_ctx as used in
>> initialization
>>> +                    (ctx->hw_frames_ctx && frame->hw_frames_ctx->data == ctx-
>>> hw_frames_ctx->data) ||
>>> +                    // check if the same hw_device_ctx as used in initialization
>>> +                    (ctx->hw_device_ctx && ((AVHWFramesContext*)frame-
>>> hw_frames_ctx->data)->device_ctx ==
>>> +                    (AVHWDeviceContext*)ctx->hw_device_ctx->data)
>>> +                )) {
>>> +                    ID3D11Texture2D* texture = (ID3D11Texture2D*)frame-
>>> data[0]; // actual texture
>>> +                    int index = (int)(size_t)frame->data[1]; // index is a slice in
>> texture array is - set to tell AMF which slice to use
>>> +                    texture->lpVtbl->SetPrivateData(texture,
>> &AMFTextureArrayIndexGUID, sizeof(index), &index);
>>> +
>>> +                    res = ctx->context->pVtbl->CreateSurfaceFromDX11Native(ctx-
>>> context, texture, &surface, NULL); // wrap to AMF surface
>>> +                    surface->pVtbl->SetCrop(surface, 0, 0, frame->width, frame-
>>> height); // decode surfaces are vertically aligned by 16 tell AMF real size
>>> +                    surface->pVtbl->SetPts(surface, frame->pts);
>>> +                } else {
>>> +                    res = ctx->context->pVtbl->AllocSurface(ctx->context,
>> AMF_MEMORY_HOST, ctx->format, avctx->width, avctx->height, &surface);
>>> +                    AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_BUG,
>> "AllocSurface() failed  with error %d", res);
>>> +                    amf_copy_surface(avctx, frame, surface);
>>> +                }
>>> +            }
>>> +            // encode
>>> +            res = ctx->encoder->pVtbl->SubmitInput(ctx->encoder,
>> (AMFData*)surface);
>>> +            if (res == AMF_INPUT_FULL) { // handle full queue
>>> +                av_usleep(1000); // input queue is full: wait, poll and submit
>> surface again
>>
>> Is there really no way in the API to wait for this properly?
>>
> 
> The AMF runtime is designed without threads and sleeps inside. It is up to
>  application to poll output and this way make space in the input HW queue. 
> But if input queue is really full it does make sense to sleep and 
> continue polling to avoid unnecessary CPU burn. 

Some APIs have a way to wait for an event for this rather than polling.  Polling like this will wake the CPU pointlessly and waste power.  (And yes, I know lots of other places do it, but if you have a proper API for this then please use it.)

>>> +            } else {
>>> +                surface->pVtbl->Release(surface);
>>> +                surface = NULL;
>>> +                AMF_RETURN_IF_FALSE(ctx, res == AMF_OK,
>> AVERROR_UNKNOWN, "SubmitInput() failed with error %d", res);
>>> +                submitted = 1;
>>> +            }
>>> +        }
>>> +        // poll results
>>> +        if (!data) {
>>> +            res = ctx->encoder->pVtbl->QueryOutput(ctx->encoder, &data);
>>> +            if (data) {
>>> +                AMFBuffer* buffer;
>>> +                AMFGuid guid = IID_AMFBuffer();
>>> +                data->pVtbl->QueryInterface(data, &guid, (void**)&buffer); //
>> query for buffer interface
>>> +                ret = amf_copy_buffer(avctx, pkt, buffer);
>>> +                if (!ret)
>>> +                    *got_packet = 1;
>>> +                buffer->pVtbl->Release(buffer);
>>> +                data->pVtbl->Release(data);
>>> +                if (ctx->eof) {
>>> +                    submitted = true; // we are in the drain state - no submissions
>>> +                }
>>> +            } else if (res == AMF_EOF) {
>>> +                submitted = true; // drain complete
>>> +            } else {
>>> +                if (!submitted) {
>>> +                    av_usleep(1000); // wait and poll again
>>> +                }
>>> +            }
>>> +        }
>>> +    }
>>
>> I suspect this setup is not actually going to follow the constraints of the
>> deprecated encode2().  Given the API here, I think you would be much better
>> off writing with send_frame()/receive_packet().
> 
> I considered this, but without a thread that would call receive_packet() the implementation 
> will fall into the same pattern as it is now but with an additional queue of ready outputs.

See the documentation - you can return EAGAIN to send_packet() when a frame is available, no external queue is required.

>>> +    AMFSize             framesize = AMFConstructSize(avctx->width, avctx-
>>> height);
>>> +    AMFRate             framerate = AMFConstructRate(avctx->time_base.den,
>> avctx->time_base.num * avctx->ticks_per_frame);
>>
>> Is VFR encoding not supported?
>>
> 
> The encoder uses frame rate only for rate control. It does not take in account 
> frame duration in case of VFR.

Sure.  (That should probably be documented somewhere.)

>>> +    if (avctx->rc_initial_buffer_occupancy != 0) {
>>> +        int percent = avctx->rc_buffer_size * 64 / avctx-
>>> rc_initial_buffer_occupancy;
>>> +        if (percent > 64)
>>> +            percent = 64;
>>
>> ???
>>
> 
> This is an attempt to translate to 1-64 range which is exposed by the encoder.

1-64, yet the variable is called percent?  Sounds very suspicious.  If that is indeed the behaviour, please add a comment explaining it.

>>> +    // Bitrate
>>> +    AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder,
>> AMF_VIDEO_ENCODER_TARGET_BITRATE, avctx->bit_rate);
>>> +    if (ctx->rate_control_mode ==
>> AMF_VIDEO_ENCODER_RATE_CONTROL_METHOD_CBR) {
>>> +        AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder,
>> AMF_VIDEO_ENCODER_PEAK_BITRATE, avctx->bit_rate);
>>> +    } else {
>>> +        int rc_max_rate = avctx->rc_max_rate >= avctx->bit_rate ? avctx-
>>> rc_max_rate : avctx->bit_rate * 13 / 10;
>>
>> Where does 13/10 come from?
>>
> 
> For best results rc_max_rate should be bigger then bit_rate  for VBR. For CBR it is ignored. 
> What is the better way to correct an unset parameter?

Max rate should only be constrained by the HRD buffering in this case.  Are you sure this isn't handled internally if you don't supply the paramater at all?  If not, maybe supply some sort of infinity to avoid it constraining anything appropriately.

>>> +
>>> +    // fill extradata
>>> +    AMFVariantInit(&var);
>>
>> Can this fail?
>>
> Not if var cleared with 0.

I would prefer that return values are checked for all functions which return them, in case of future API changes.  If the function can't return an error, why isn't it void?

>>> +// encoder connected with D3D11 HW accelerator
>>> +AVCodec ff_h264_amf_d3d11va_encoder = {
>>> +    .name = "h264_amf_d3d11va",
>>> +    .long_name = NULL_IF_CONFIG_SMALL("AMD AMF H.264 Encoder with
>> d3d11va"),
>>> +    .type = AVMEDIA_TYPE_VIDEO,
>>> +    .id = AV_CODEC_ID_H264,
>>> +    .init = amf_encode_init_h264,
>>> +    .encode2 = ff_amf_encode_frame,
>>> +    .close = ff_amf_encode_close,
>>> +    .priv_data_size = sizeof(AmfContext),
>>> +    .priv_class = &h264_amf_d3d11va_class,
>>> +    .defaults = defaults,
>>> +    .capabilities = AV_CODEC_CAP_DELAY,
>>> +    .caps_internal = FF_CODEC_CAP_INIT_CLEANUP,
>>> +    .pix_fmts = ff_amf_pix_fmts,
>>> +};
>>
>> As above, why does this separate (identical) instance exist?
> 
> See explanation about accelerator handling for encoders above.
> 
>>> +static const AVOption options[] = {
>>> +    { "usage",          "Set the encoding usage",             OFFSET(usage),
>> AV_OPT_TYPE_INT,   { .i64 =
>> AMF_VIDEO_ENCODER_HEVC_USAGE_TRANSCONDING },
>> AMF_VIDEO_ENCODER_HEVC_USAGE_TRANSCONDING,
>> AMF_VIDEO_ENCODER_HEVC_USAGE_WEBCAM, VE, "usage" },
>>> +    { "transcoding",    "", 0, AV_OPT_TYPE_CONST, { .i64 =
>> AMF_VIDEO_ENCODER_HEVC_USAGE_TRANSCONDING },         0, 0, VE,
>> "usage" },
>>> +    { "ultralowlatency","", 0, AV_OPT_TYPE_CONST, { .i64 =
>> AMF_VIDEO_ENCODER_HEVC_USAGE_ULTRA_LOW_LATENCY },    0, 0, VE,
>> "usage" },
>>> +    { "lowlatency",     "", 0, AV_OPT_TYPE_CONST, { .i64 =
>> AMF_VIDEO_ENCODER_HEVC_USAGE_LOW_LATENCY },          0, 0, VE,
>> "usage" },
>>> +    { "webcam",         "", 0, AV_OPT_TYPE_CONST, { .i64 =
>> AMF_VIDEO_ENCODER_HEVC_USAGE_WEBCAM },               0, 0, VE, "usage" },
>>
>> Could some of this be in common with the H.264 encoder?  (Maybe in the
>> header?)
>>
> 
> I tried to keep H264 and HAVC parameters completely separate. I was asked by codec 
> team to do so in AMF API  and did the same here.

Ok, sure.

Some more random questions:

* How can we supply colour information to the codecs?  (VUI colour_primaries/transfer_characteristics/matrix_coefficients/chroma_sample_loc_type.)

* Does timing SEI (buffering_period/pic_timing) get generated for bitrate-targetted modes?  Is there any way to control that?

* When you say this will be used on Linux, is that going to be in Mesa (integrated with VAAPI as the Windows code is with D3D?) or will it be something in the proprietary drivers?

Thanks,

- Mark