[FFmpeg-devel] [PATCH 1/5] lavc/htmlsubtitles: improve handling broken garbage

Michael Niedermayer michael at niedermayer.cc
Sun Jul 30 05:34:16 EEST 2017


On Sat, Jul 29, 2017 at 09:27:47PM +0200, Clément Bœsch wrote:
> This commit switches off forced correct nesting of tags and only keeps
> it for font tags. See long explanations in the code for the rationale.
> 
> This results in various FATE changes which I'll explain here:
> 
> - various swapping in font attributes, this is mostly noise due to the
>   old reverse stack way of printing them. The new one is more correct as
>   the last attribute takes over the previous ones.
> 
> - unrecognized tags disappears
> 
> - invalid tags that were previously displayed aren't anymore (instead,
>   we have a warning). This is better for the end user
> 
> The main benefit of this commit is to be more tolerant to error, leading
> to a better handling of badly nested tags or random wrong formatting for
> the end user.
> ---
>  libavcodec/htmlsubtitles.c       | 199 ++++++++++++++++++++++++++-------------
>  tests/ref/fate/sub-sami2         |   4 +-
>  tests/ref/fate/sub-srt           |  14 +--
>  tests/ref/fate/sub-srt-badsyntax |   4 +-
>  tests/ref/fate/sub-textenc       |  14 +--
>  tests/ref/fate/sub-webvttenc     |  14 +--
>  6 files changed, 157 insertions(+), 92 deletions(-)
> 
> diff --git a/libavcodec/htmlsubtitles.c b/libavcodec/htmlsubtitles.c
> index e1fb7bc3cf..69d855df21 100644
> --- a/libavcodec/htmlsubtitles.c
> +++ b/libavcodec/htmlsubtitles.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2010  Aurelien Jacobs <aurel at gnuage.org>
> + * Copyright (c) 2017  Clément Bœsch <u at pkh.me>
>   *
>   * This file is part of FFmpeg.
>   *
> @@ -18,6 +19,7 @@
>   * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
>   */
>  
> +#include "libavutil/avassert.h"
>  #include "libavutil/avstring.h"
>  #include "libavutil/common.h"
>  #include "libavutil/parseutils.h"
> @@ -31,19 +33,6 @@ static int html_color_parse(void *log_ctx, const char *str)
>      return rgba[0] | rgba[1] << 8 | rgba[2] << 16;
>  }
>  
> -enum {
> -    PARAM_UNKNOWN = -1,
> -    PARAM_SIZE,
> -    PARAM_COLOR,
> -    PARAM_FACE,
> -    PARAM_NUMBER
> -};
> -
> -typedef struct SrtStack {
> -    char tag[128];
> -    char param[PARAM_NUMBER][128];
> -} SrtStack;
> -
>  static void rstrip_spaces_buf(AVBPrint *buf)
>  {
>      if (av_bprint_is_complete(buf))
> @@ -75,17 +64,48 @@ static void handle_open_brace(AVBPrint *dst, const char **inp, int *an, int *clo
>      av_bprint_chars(dst, *in, 1);
>  }
>  
> +struct font_tag {
> +    char face[128];
> +    int size;
> +    uint32_t color;
> +};
> +
> +/*
> + * The general politic of the convert is to mask unsupported tags or formatting
> + * errors (but still alert the user/subtitles writer with an error/warning)
> + * without dropping any actual text content for the final user.
> + */
>  int ff_htmlmarkup_to_ass(void *log_ctx, AVBPrint *dst, const char *in)
>  {
> -    char *param, buffer[128], tmp[128];
> -    int len, tag_close, sptr = 1, line_start = 1, an = 0, end = 0;
> -    SrtStack stack[16];
> +    char *param, buffer[128];
> +    int len, tag_close, sptr = 0, line_start = 1, an = 0, end = 0;
>      int closing_brace_missing = 0;
> +    int i, likely_a_tag;
>  
> -    stack[0].tag[0] = 0;
> -    strcpy(stack[0].param[PARAM_SIZE],  "{\\fs}");
> -    strcpy(stack[0].param[PARAM_COLOR], "{\\c}");
> -    strcpy(stack[0].param[PARAM_FACE],  "{\\fn}");
> +    /*
> +     * state stack is only present for fonts since they are the only tags where
> +     * the state is not binary. Here is a typical use case:
> +     *
> +     *   <font color="red" size=10>
> +     *     red 10
> +     *     <font size=50> RED AND BIG </font>
> +     *     red 10 again
> +     *   </font>
> +     *
> +     * On the other hand, using the state system for all the tags should be
> +     * avoided because it breaks wrongly nested tags such as:
> +     *
> +     *   <b> foo <i> bar </b> bla </i>
> +     *
> +     * We don't want to break here; instead, we will treat all these tags as
> +     * binary state markers. Basically, "<b>" will activate bold, and "</b>"
> +     * will deactivate it, whatever the current state.
> +     *
> +     * This will also prevents cases where we have a random closing tag
> +     * remaining after the opening one was dropped. Yes, this happens and we
> +     * still don't want to print a "</b>" at the end of the dialog event.
> +     */
> +    struct font_tag stack[16] = {0};

this seems to produce a compiler warning:

./libavcodec/htmlsubtitles.c: In function ‘ff_htmlmarkup_to_ass’:
./libavcodec/htmlsubtitles.c:112:12: warning: missing braces around initializer [-Wmissing-braces]

gcc (Ubuntu 4.8.5-2ubuntu1~14.04.1) 4.8.5

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

During times of universal deceit, telling the truth becomes a
revolutionary act. -- George Orwell
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20170730/3c0564f4/attachment.sig>


More information about the ffmpeg-devel mailing list