[FFmpeg-devel] [PATCH] lavfi: add concat filter.

Thu Jul 19 20:14:46 CEST 2012

On date Wednesday 2012-07-18 00:48:55 +0200, Nicolas George encoded:
> 
> Signed-off-by: Nicolas George <nicolas.george at normalesup.org>
> ---
>  Changelog                |    1 +
>  doc/filters.texi         |   73 +++++++++
>  libavfilter/Makefile     |    1 +
>  libavfilter/allfilters.c |    1 +
>  libavfilter/f_concat.c   |  409 ++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 485 insertions(+)
>  create mode 100644 libavfilter/f_concat.c
> 
> 
> Note: do not try it with -filter_complex yet, or be sure to only use very
> short files that can be decompressed completely in memory. Also, it will not
> work withoht the "ffmpeg: probe buffersinks once more after EOF" patch.
> 
> 
> diff --git a/Changelog b/Changelog
> index e7b8a02..073b851 100644
> --- a/Changelog
> +++ b/Changelog
> @@ -29,6 +29,7 @@ version next:
>  - new option: -progress
>  - 3GPP Timed Text decoder
>  - GeoTIFF decoder support
> +- concat filter
>  
>  
>  version 0.11:
> diff --git a/doc/filters.texi b/doc/filters.texi
> index 4a6c092..cd2e7a8 100644
> --- a/doc/filters.texi
> +++ b/doc/filters.texi
> @@ -3988,6 +3988,79 @@ tools.
>  
>  Below is a description of the currently available transmedia filters.
>  
> + at section concat
> +
> +Concatenate audio and video streams, joining them together one after the
> +other.
> +

> +The filter works on segments of synchronized video and audio streams. All
> +segments must have the same number of streams of each type, and that will
> +also be the number of streams at output.
> +
> +The filter accepts the following named parameters:
> + at table @option
> +
> + at item s
> +Set the number of segments. Default is 2.

Maybe "n", so there is no conflict when Subtitles will be added (if
ever). Also having long aliases may help script readability.

> +
> + at item v
> +Set the number of video streams. Default is 1.

output video streams, that is also the number of video streams in each
segment.

> +
> + at item a
> +Set the number of audio streams. Default is 0.

output audio streams, that is also the number of audio streams in each
segment.

> +
> + at end table
> +
> +The filter has @var{v}+ at var{a} outputs: first @var{v} video outputs, then
> + at var{a} audio outputs.
> +
> +There are @var{s}×(@var{v}+ at var{a}) inputs: first the inputs for the first
> +segment, in the same order as the outputs, then the inputs for the second
> +segment, etc.
> +
> +Related streams do not always have exactly the same duration, for various
> +reasons including codec frame size or sloppy authoring. For that reason,
> +related synchronized streams (e.g. a video and its audio track) should be
> +concatenated at once. The concat filter will use the duration of the longest
> +stream in each segment (except the last one), and if necessary pad shorter
> +audio streams with silence.
> +
> +For this filter to work correctly, all segments must start at timestamp 0.
> +
> +All corresponding streams must have the same parameters in all segments; the
> +filtering system will automatically select a common pixel format for video
> +streams, and a common sample format, sample rate and channel layout for
> +audio streams, but other settings, such as resolution, must be converted
> +explicitly by the user.
> +
> +Different frame rates are acceptable but will result in variable frame rate
> +at output; be sure to configure the output file to handle it.
> +
> +Examples:
> + at itemize
> + at item
> +Concatenate an opening, an episode and an ending, all in bilingual version
> +(video in stream 0, audio in streams 1 and 2):
> + at example
> +ffmpeg -i opening.mkv -i episode.mkv -i ending.mkv -filter_complex \
> +  '[0:0] [0:1] [0:2] [1:0] [1:1] [1:2] [2:0] [2:1] [2:2]
> +   concat=s=3:v=1:a=2 [v] [a1] [a2]' \
> +  -map '[v]' -map '[a1]' -map '[a2]' output.mkv
> + at end example
> +
> + at item
> +Concatenate two parts, handling audio and video separately, using the
> +(a)movie sources, and adjusting the resolution:
> + at example
> +movie=part1.mp4, scale=512:288 [v1] ; amovie=part1.mp4 [a1] ;
> +movie=part2.mp4, scale=512:288 [v2] ; amovie=part2.mp4 [a2] ;
> +[v1] [v2] concat [outv] ; [a1] [a2] concat=v=0:a=1 [outa]
> + at end example

Why not:
movie=part1.mp4, scale=512:288 [v1] ; amovie=part1.mp4 [a1] ;
movie=part2.mp4, scale=512:288 [v2] ; amovie=part2.mp4 [a2] ;
[v1][a1] [v2][a2] concat=v=1:a=1 [outv][outa]
?

> +Note that a desync will happen at the stitch if the audio and video streams
> +do not have exactly the same duration in the first file.
> +
> + at end itemize
> +
>  @section showwaves
>  
>  Convert input audio to a video output, representing the samples waves.
> diff --git a/libavfilter/Makefile b/libavfilter/Makefile
> index b094f59..642a105 100644
> --- a/libavfilter/Makefile
> +++ b/libavfilter/Makefile
> @@ -197,6 +197,7 @@ OBJS-$(CONFIG_MP_FILTER) += libmpcodecs/vf_yvu9.o
>  OBJS-$(CONFIG_MP_FILTER) += libmpcodecs/pullup.o
>  
>  # transmedia filters
> +OBJS-$(CONFIG_CONCAT_FILTER)                 += f_concat.o
>  OBJS-$(CONFIG_SHOWWAVES_FILTER)              += avf_showwaves.o

maybe I should rename avf_showwaves -> f_showwaves (or the other way
around, f_concat -> avf_concat).

>  
>  TOOLS     = graph2dot
> diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
> index 706405e..cae4c99 100644
> --- a/libavfilter/allfilters.c
> +++ b/libavfilter/allfilters.c
> @@ -134,6 +134,7 @@ void avfilter_register_all(void)
>      REGISTER_FILTER (NULLSINK,    nullsink,    vsink);
>  
>      /* transmedia filters */
> +    REGISTER_FILTER (CONCAT,      concat,      avf);
>      REGISTER_FILTER (SHOWWAVES,   showwaves,   avf);
>  
>      /* those filters are part of public or internal API => registered
> diff --git a/libavfilter/f_concat.c b/libavfilter/f_concat.c
> new file mode 100644
> index 0000000..7a8912f
> --- /dev/null
> +++ b/libavfilter/f_concat.c
> @@ -0,0 +1,409 @@
> +/*
> + * Copyright (c) 2012 Nicolas George
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> + * See the GNU Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public License
> + * along with FFmpeg; if not, write to the Free Software Foundation, Inc.,
> + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +/**
> + * @file
> + * concat audio-video filter
> + */
> +
> +#include "libavutil/avassert.h"
> +#include "libavutil/opt.h"
> +#include "avfilter.h"
> +#define FF_BUFQUEUE_SIZE 256
> +#include "bufferqueue.h"
> +#include "internal.h"
> +#include "video.h"
> +#include "audio.h"
> +
> +#define TYPE_ALL 2
> +
> +typedef struct {
> +    const AVClass *class;

> +    unsigned nb_streams[TYPE_ALL];

Nit: this is ambiguous, what about nb_segment_streams (or simply add a ///< doxy

> +    unsigned nb_seg;

nit: nb_segments

> +    unsigned cur_idx;
> +    int64_t delta_ts;

please explain in doxy what these fields mean

> +    unsigned nb_in_active;

comment: number of active input segments, or number of active inputs?

Also I suppose a segment is considered "active" when no corresponding
stream already terminated, right?

> +    struct concat_in {
> +        int64_t pts;
> +        int64_t nb_frames;
> +        unsigned eof;
> +        struct FFBufQueue queue;
> +    } *in;
> +} ConcatContext;
> +
> +#define OFFSET(x) offsetof(ConcatContext, x)
> +
> +static const AVOption concat_options[] = {
> +    { "s", "specify the number of segments", OFFSET(nb_seg),
> +      AV_OPT_TYPE_INT, { .dbl = 2 }, 2, INT_MAX },
> +    { "v", "specify the number of video streams",
> +      OFFSET(nb_streams[AVMEDIA_TYPE_VIDEO]),
> +      AV_OPT_TYPE_INT, { .dbl = 1 }, 1, INT_MAX },
> +    { "a", "specify the number of audio streams",
> +      OFFSET(nb_streams[AVMEDIA_TYPE_AUDIO]),
> +      AV_OPT_TYPE_INT, { .dbl = 0 }, 0, INT_MAX },
> +    { 0 }
> +};

Nit+: one line per option?

> +
> +AVFILTER_DEFINE_CLASS(concat);
> +
> +static av_cold void uninit(AVFilterContext *ctx)

I'd find more logical to have init before uninit.

> +{
> +    ConcatContext *cat = ctx->priv;
> +    unsigned i;
> +
> +    for (i = 0; i < ctx->nb_inputs; i++) {
> +        av_freep(&ctx->input_pads[i].name);
> +        ff_bufqueue_discard_all(&cat->in[i].queue);
> +    }
> +    for (i = 0; i < ctx->nb_outputs; i++)
> +        av_freep(&ctx->output_pads[i].name);
> +    av_free(cat->in);
> +}
> +
> +static int query_formats(AVFilterContext *ctx)
> +{
> +    ConcatContext *cat = ctx->priv;
> +    unsigned type, nb_str, idx0 = 0, idx, str, seg;

Nit++: "str" is usually associated to strings, I suggest to
nb_str->nb_streams and str->stream_idx and move them to the internal
loop where they're used.

> +    AVFilterFormats *formats, *rates;
> +    AVFilterChannelLayouts *layouts;
> +
> +    for (type = 0; type < TYPE_ALL; type++) {
> +        nb_str = cat->nb_streams[type];
> +        for (str = 0; str < nb_str; str++) {
> +            idx = idx0;

maybe add a comment here along the lines: set the output stream formats

> +            formats = ff_all_formats(AVMEDIA_TYPE_VIDEO);
> +            if (!formats)
> +                return AVERROR(ENOMEM);
> +            ff_formats_ref(formats, &ctx->outputs[idx]->in_formats);

Uhm... why ff_all_formats(AVMEDIA_TYPE_VIDEO) is not under an if
(VIDEO) block?

> +            if (type == AVMEDIA_TYPE_AUDIO) {
> +                rates = ff_all_samplerates();
> +                layouts = ff_all_channel_layouts();
> +                if (!rates || !layouts)
> +                    return AVERROR(ENOMEM);

leak if rates && !layouts

> +                ff_formats_ref(rates, &ctx->outputs[idx]->in_samplerates);
> +                ff_channel_layouts_ref(layouts, &ctx->outputs[idx]->in_channel_layouts);
> +            }

comment: set the formats for each corresponding segment input

> +            for (seg = 0; seg < cat->nb_seg; seg++) {
> +                ff_formats_ref(formats, &ctx->inputs[idx]->out_formats);
> +                if (type == AVMEDIA_TYPE_AUDIO) {
> +                    ff_formats_ref(rates, &ctx->inputs[idx]->out_samplerates);
> +                    ff_channel_layouts_ref(layouts, &ctx->inputs[idx]->out_channel_layouts);
> +                }
> +                idx += ctx->nb_outputs;
> +            }
> +            idx0++;
> +        }
> +    }
> +    return 0;
> +}
> +
> +static unsigned find_link(AVFilterLink *link, AVFilterLink **links, unsigned n)
> +{
> +    unsigned i;
> +
> +    for (i = 0; i < n; i++)
> +        if (link == links[i])
> +            return i;
> +    av_assert0(!"link found");
> +}
> +

> +static int config_output(AVFilterLink *outlink)
> +{
> +    AVFilterContext *ctx = outlink->src;
> +    ConcatContext *cat   = ctx->priv;
> +    unsigned out_no = find_link(outlink, ctx->outputs, ctx->nb_outputs);
> +    unsigned in_no  = out_no, seg;
> +    AVFilterLink *inlink = ctx->inputs[in_no];
> +
> +    /* enhancement: find a common one */
> +    outlink->time_base           = AV_TIME_BASE_Q;
> +    outlink->w                   = inlink->w;
> +    outlink->h                   = inlink->h;
> +    outlink->sample_aspect_ratio = inlink->sample_aspect_ratio;
> +    outlink->format              = inlink->format;
> +    for (seg = 1; seg < cat->nb_seg; seg++) {
> +        inlink = ctx->inputs[in_no += ctx->nb_outputs];

> +        /* possible enhancement: unsafe mode, do not check */

what would be the advantage of such mode?

> +        /* possible enhancement: be more specific */

Yes please do (see below)

> +        if (outlink->w                       != inlink->w                       ||
> +            outlink->h                       != inlink->h                       ||
> +            outlink->sample_aspect_ratio.num != inlink->sample_aspect_ratio.num ||
> +            outlink->sample_aspect_ratio.den != inlink->sample_aspect_ratio.den) {

> +            av_log(ctx, AV_LOG_ERROR, "Input %s has different parameters.\n",
> +                   ctx->input_pads[in_no].name);

I suggest: "Input link %s parameters size:%dx%d sar:%d/%d mismatch output link %s parameters size:%dx%d sar:%d/%d mismatch\n"

Nit++: no need for the final dot, inconsistent

> +            return AVERROR(EINVAL);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static void push_frame(AVFilterContext *ctx, unsigned in_no,
> +                       AVFilterBufferRef *buf)
> +{
> +    ConcatContext *cat = ctx->priv;
> +    unsigned out_no = in_no % ctx->nb_outputs;
> +    AVFilterLink * inlink = ctx-> inputs[ in_no];
> +    AVFilterLink *outlink = ctx->outputs[out_no];
> +    struct concat_in *in = &cat->in[in_no];
> +
> +    buf->pts = av_rescale_q(buf->pts, inlink->time_base, outlink->time_base);
> +    in->pts = buf->pts;
> +    in->nb_frames++;
> +    /* add duration to input PTS */

> +    if (inlink->sample_rate)

when can it happen than sample_rate is not set?

> +        /* use number of audio samples */
> +        in->pts += av_rescale_q(buf->audio->nb_samples,
> +                                (AVRational){ 1, inlink->sample_rate },
> +                                outlink->time_base);
> +    else if (in->nb_frames >= 2)
> +        /* use mean duration */
> +        in->pts = av_rescale(in->pts, in->nb_frames, in->nb_frames - 1);
> +
> +    buf->pts += cat->delta_ts;
> +    switch (buf->type) {
> +    case AVMEDIA_TYPE_VIDEO:
> +        ff_start_frame(outlink, buf);
> +        ff_draw_slice(outlink, 0, outlink->h, 1);
> +        ff_end_frame(outlink);
> +        break;
> +    case AVMEDIA_TYPE_AUDIO:
> +        ff_filter_samples(outlink, buf);
> +        break;
> +    }
> +}
> +

> +static void incoming_frame(AVFilterLink *inlink, AVFilterBufferRef *buf)

I'm slightly against the use of a nominal phrase for a function name,
what about "process_frame()"?

> +{
> +    AVFilterContext *ctx  = inlink->dst;
> +    ConcatContext *cat    = ctx->priv;
> +    unsigned in_no = find_link(inlink, ctx->inputs, ctx->nb_inputs);
> +
> +    if (in_no < cat->cur_idx) {
> +        av_log(ctx, AV_LOG_ERROR, "Frame after EOF on input %s\n",
> +               ctx->input_pads[in_no].name);
> +        avfilter_unref_buffer(buf);
> +    } if (in_no >= cat->cur_idx + ctx->nb_outputs) {
> +        ff_bufqueue_add(ctx, &cat->in[in_no].queue, buf);
> +    } else {
> +        push_frame(ctx, in_no, buf);
> +    }
> +}
> +

> +static void start_frame(AVFilterLink *inlink, AVFilterBufferRef *buf)
> +{
> +}
> +
> +static void draw_slice(AVFilterLink *inlink, int y, int h, int dir)
> +{
> +}

Nit+++: func(...) { } 

saves some lines and looks more readable (to me).

> +
> +static void end_frame(AVFilterLink *inlink)
> +{
> +    incoming_frame(inlink, inlink->cur_buf);
> +}
> +
> +static int filter_samples(AVFilterLink *inlink, AVFilterBufferRef *buf)
> +{
> +    incoming_frame(inlink, buf);
> +    return 0; /* enhancement: handle error return */
> +}
> +
> +static void close_input(AVFilterContext *ctx, unsigned in_no)
> +{
> +    ConcatContext *cat = ctx->priv;
> +
> +    cat->in[in_no].eof = 1;
> +    cat->nb_in_active--;
> +    av_log(ctx, AV_LOG_VERBOSE, "EOF on %s, %d streams left in segment.\n",
> +           ctx->input_pads[in_no].name, cat->nb_in_active);
> +}
> +

> +static void find_next_delta_ts(AVFilterContext *ctx)
> +{
> +    ConcatContext *cat = ctx->priv;
> +    unsigned i = cat->cur_idx;
> +    unsigned imax = i + ctx->nb_outputs;
> +    int64_t pts;
> +
> +    pts = cat->in[i++].pts;
> +    for (; i < imax; i++)
> +        pts = FFMAX(pts, cat->in[i].pts);
> +    cat->delta_ts += pts;
> +}
> +
> +static void send_silence(AVFilterContext *ctx, unsigned in_no, unsigned out_no)
> +{
> +    ConcatContext *cat = ctx->priv;
> +    AVFilterLink *outlink = ctx->outputs[out_no];
> +    int64_t base_pts = cat->in[in_no].pts;
> +    int64_t nb_samples, sent = 0;
> +    int frame_size;
> +    AVRational stb = { 1, ctx->inputs[in_no]->sample_rate };

nit++: tb or time_base seems less ambiguous at first reading (silencetb, streamtb, streamb?)

> +    AVFilterBufferRef *buf;
> +
> +    if (!stb.den)
> +        return;
> +    nb_samples = av_rescale_q(cat->delta_ts - base_pts,
> +                              outlink->time_base, stb);
> +    frame_size = FFMAX(9600, stb.den / 5); /* arbitrary */
> +    while (nb_samples) {
> +        frame_size = FFMIN(frame_size, nb_samples);
> +        buf = ff_get_audio_buffer(outlink, AV_PERM_WRITE, frame_size);
> +        if (!buf)
> +            return;
> +        buf->pts = base_pts + av_rescale_q(sent, stb, outlink->time_base);
> +        ff_filter_samples(outlink, buf);
> +        sent       += frame_size;
> +        nb_samples -= frame_size;
> +    }
> +}
[...]
> +AVFilter avfilter_avf_concat = {
> +    .name          = "concat",

> +    .description   = NULL_IF_CONFIG_SMALL("concatenate audio and video streams."),

Concatenate ...

[...]
-- 
FFmpeg = Fast & Foolish Minimal Pacific Earthshaking God