[FFmpeg-devel] [PATCH] movtextdec: fix handling of UTF-8 subtitles

wm4 nfxjfg at googlemail.com
Sat Mar 24 18:42:04 EET 2018


On Sat, 24 Mar 2018 17:05:41 +0100
Carl Eugen Hoyos <ceffmpeg at gmail.com> wrote:

> 2018-03-24 15:48 GMT+01:00, wm4 <nfxjfg at googlemail.com>:
> > Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> > ASCII characters) were not handled correctly. The spec mandates that
> > styling start/end ranges are in "characters". It's not quite clear what
> > a "character" is supposed to be, but maybe they mean unicode codepoints.
> >
> > FFmpeg's decoder treated the style ranges as byte idexes, which could
> > lead to UTF-8 sequences being broken, and the common code dropping the
> > whole subtitle line.
> >
> > Change this and count the codepoint instead. This also means that even
> > if this is somehow wrong, the decoder won't break UTF-8 sequences
> > anymore. The sample which led me to investigate this now appears to work
> > correctly.  
> 
> Could you confirm that this is also what QT does?

I can't test with QT. VLC seems to behave like with this patch applied.

> Or is it impossible that the patch breaks something?

Could probably break movtext subtitles generated by ffmpeg (I didn't
fix the movtext encoder, and it seems to have the same bug). But these
will most likely be broken on other players too. Tough the worst case
is just that the styles get shifted.

> [...]
> 
> > @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char
> > *text, const char *text_end,
> >              }
> >          }
> >
> > -        switch (*text) {
> > -        case '\r':
> > -            break;
> > -        case '\n':
> > -            av_bprintf(buf, "\\N");
> > -            break;
> > -        default:
> > -            av_bprint_chars(buf, *text, 1);
> > -            break;
> > +        len = get_utf8_length_at(text, text_end);
> > +        if (len < 1) {
> > +            av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in
> > subtitle\n");
> > +            len = 1;
> > +        }
> > +        for (i = 0; i < len; i++) {  
> 
> > +            switch (*text) {
> > +            case '\r':
> > +                break;
> > +            case '\n':
> > +                av_bprintf(buf, "\\N");
> > +                break;
> > +            default:
> > +                av_bprint_chars(buf, *text, 1);
> > +                break;  
> 
> Imo, the reindentation is not ok but this isn't my code.

Why not?


More information about the ffmpeg-devel mailing list