[FFmpeg-devel] [RFC]] swscale modernization proposal

Sat Jun 22 16:13:34 EEST 2024

Hey,

As some of you know, I got contracted (by STF 2024) to work on improving
swscale, over the course of the next couple of months. I want to share my
current plans and gather feedback + measure sentiment.

## Problem statement

The two issues I'd like to focus on for now are:

1. Lack of support for a lot of modern formats and conversions (HDR, ICtCp,
   IPTc2, BT.2020-CL, XYZ, YCgCo, Dolby Vision, ...)
2. Complicated context management, with cascaded contexts, threading, stateful
   configuration, multi-step init procedures, etc; and related bugs

In order to make these feasible, some amount of internal re-organization of
duties inside swscale is prudent.

## Proposed approach

The first step is to create a new API, which will (tentatively) live in
<libswscale/avscale.h>. This API will initially start off as a near-copy of the
current swscale public API, but with the major difference that I want it to be
state-free and only access metadata in terms of AVFrame properties. So there
will be no independent configuration of the input chroma location etc. like
there is currently, and no need to re-configure or re-init the context when
feeding it frames with different properties. The goal is for users to be able
to just feed it AVFrame pairs and have it internally cache expensive
pre-processing steps as needed. Finally, avscale_* should ultimately also
support hardware frames directly, in which case it will dispatch to some
equivalent of scale_vulkan/vaapi/cuda or possibly even libplacebo. (But I will
defer this to a future milestone)

After this API is established, I want to start expanding the functionality in
the following manner:

### Phase 1

For basic operation, avscale_* will just dispatch to a sequence of swscale_*
invocations. In the basic case, it will just directly invoke swscale with
minimal overhead. In more advanced cases, it might resolve to a *sequence* of
swscale operations, with other operations (e.g. colorspace conversions a la
vf_colorspace) mixed in.

This will allow us to gain new functionality in a minimally invasive way, and
will let API users start porting to the new API. This will also serve as a good
"selling point" for the new API, allowing us to hopefully break up the legacy
swscale API afterwards.

### Phase 2

After this is working, I want to cleanly separate swscale into two distinct
components:

1. vertical/horizontal scaling
2. input/output conversions

Right now, these operations both live inside the main SwsContext, even though
they are conceptually orthogonal. Input handling is done entirely by the
abstract callbacks lumToYV12 etc., while output conversion is currently
"merged" with vertical scaling (yuv2planeX etc.).

I want to cleanly separate these components so they can live inside independent
contexts, and be considered as semantically distinct steps. (In particular,
there should ideally be no more "unscaled special converters", instead this can
be seen as a special case where there simply is no vertical/horizontal scaling
step)

The idea is for the colorspace conversion layer to sit in between the
input/output converters and the horizontal/vertical scalers. This all would be
orchestrated by the avscale_* abstraction.

## Implementation details

To avoid performance loss from separating "merged" functions into their
constituents, care needs to be taken such that all intermediate data, in
addition to all involved look-up tables, will fit comfortably inside the L1
cache. The approach I propose, which is also (afaict) used by zscale, is to
loop over line segments, applying each operation in sequence, on a small
temporary buffer.

e.g.

hscale_row(pixel *dst, const pixel *src, int img_width)
{
    const int SIZE = 256; // or some other small-ish figure, possibly a design
                          // constant of the API so that SIMD implementations
                          // can be appropriately unrolled

    pixel tmp[SIZE];
    for (i = 0; i < img_width; i += SIZE) {
        int pixels = min(SIZE, img_width - i);

        { /* inside read input callback */
            unpack_input(tmp, src, pixels);
            // the amount of separation here will depend on the performance
            apply_matrix3x3(tmp, yuv2rgb, pixels);
            apply_lut3x1d(tmp, gamma_lut, pixels);
            ...
        }

        hscale(dst, tmp, filter, pixels);

        src += pixels;
        dst += scale_factor(pixels);
    }
}

This function can then output rows into a ring buffer for use inside the
vertical scaler, after which the same procedure happens (in reverse) for the
final output pass.

Possibly, we also want to additionally limit the size of a row for the
horizontal scaler, to allow arbitrary large input images.

## Comments / feedback?

Does the above approach seem reasonable? How do people feel about introducing
a new API vs. trying to hammer the existing API into the shape I want it to be?

I've attached an example of what <avscale.h> could end up looking like. If
there is broad agreement on this design, I will move on to an implementation.
-------------- next part --------------
/*
 * Copyright (C) 2024 Niklas Haas
 *
 * This file is part of FFmpeg.
 *
 * FFmpeg is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public
 * License as published by the Free Software Foundation; either
 * version 2.1 of the License, or (at your option) any later version.
 *
 * FFmpeg is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public
 * License along with FFmpeg; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
 */

#ifndef SWSCALE_AVSCALE_H
#define SWSCALE_AVSCALE_H

/**
 * @file
 * @ingroup libsws
 * Higher-level wrapper around libswscale + related libraries, which is
 * capable of handling more advanced colorspace transformations.
 */

#include "libavutil/frame.h"
#include "libavutil/log.h"

/**
 * Main external API structure. New fields cannot be added to the end with
 * minor version bumps. Removal, reordering and changes to existing fields
 * require a major version bump. sizeof(AVScaleContext) is not part of the ABI.
 */
typedef struct AVScaleContext {
    const AVClass *av_class;

    /**
     * Private context used for internal data.
     */
    struct AVScaleInternal *internal;

    /**
     * Private data of the user, can be used to carry app specific stuff.
     */
    void *opaque;

    /**
     * Bitmask of AV_SCALE_* flags.
     */
    int64_t flags;

    /**
     * Bitmask of SWS_* flags.
     */
    int sws_flags;

    /**
     * Bitmask of SWS_DITHER_* flags.
     */
    int sws_dither;
} AVScaleContext;

enum {
    /**
     * Skip linearization, speeds up downscaling at the cost of quality.
     */
    AV_SCALE_NOLINEAR   = 1 << 0,
    /**
     * Force full internal conversion to RGB. May improve the quality of
     * certain operations, at the cost of performance.
     */
    AV_SCALE_FULL_RGB   = 1 << 1,
    /**
     * Perform perceptual conversion between colorspaces instead of clipping.
     */
    AV_SCALE_PERCEPTUAL = 1 << 2,

    // ...
};

/**
 * Allocate an AVScaleContext and set its fields to default values. The
 * resulting struct should be freed with avscale_free_context().
 */
AVScaleContext *avscale_alloc_context(void);

/**
 * Free the codec context and everything associated with it, and write NULL
 * to the provided pointer.
 */
void avscale_free_context(AVScaleContext **ctx);

/**
 * Get the AVClass for AVScaleContext. It can be used in combination with
 * AV_OPT_SEARCH_FAKE_OBJ for examining options.
 *
 * @see av_opt_find().
 */
const AVClass *avscale_get_class(void);

/**
 * Scale source data from `src` and write the output to `dst`.
 *
 * @param ctx   The scaling context.
 * @param dst   The destination frame.
 *
 *              The data buffers may either be already allocated by the caller
 *              or left clear, in which case they will be allocated by the
 *              scaler. The latter may have performance advantages - e.g. in
 *              certain cases some output planes may be references to input
 *              planes, rather than copies.
 *
 * @param src   The source frame. If the data buffers are set to NULL, then
 *              this function performs no conversion. It will instead merely
 *              initialize internal state that *would* be required to perform
 *              the operation, as well as returing the correct error code for
 *              unsupported frame combinations.
 *
 * @return 0 on success, a negative AVERROR code on failure.
 */
int avscale_frame(AVScaleContext *ctx, AVFrame *dst, const AVFrame *src);

#endif /* SWSCALE_AVSCALE_H */