[FFmpeg-devel] [PATCH] movtextdec: fix handling of UTF-8 subtitles

Carl Eugen Hoyos ceffmpeg at gmail.com
Sat Mar 24 18:50:53 EET 2018


2018-03-24 17:42 GMT+01:00, wm4 <nfxjfg at googlemail.com>:
> On Sat, 24 Mar 2018 17:05:41 +0100
> Carl Eugen Hoyos <ceffmpeg at gmail.com> wrote:
>
>> 2018-03-24 15:48 GMT+01:00, wm4 <nfxjfg at googlemail.com>:
>> > Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
>> > ASCII characters) were not handled correctly. The spec mandates that
>> > styling start/end ranges are in "characters". It's not quite clear what
>> > a "character" is supposed to be, but maybe they mean unicode codepoints.
>> >
>> > FFmpeg's decoder treated the style ranges as byte idexes, which could
>> > lead to UTF-8 sequences being broken, and the common code dropping the
>> > whole subtitle line.
>> >
>> > Change this and count the codepoint instead. This also means that even
>> > if this is somehow wrong, the decoder won't break UTF-8 sequences
>> > anymore. The sample which led me to investigate this now appears to work
>> > correctly.
>>
>> Could you confirm that this is also what QT does?
>
> I can't test with QT. VLC seems to behave like with this patch applied.
>
>> Or is it impossible that the patch breaks something?
>
> Could probably break movtext subtitles generated by ffmpeg (I didn't
> fix the movtext encoder, and it seems to have the same bug). But these
> will most likely be broken on other players too. Tough the worst case
> is just that the styles get shifted.

Thank you.

>> [...]
>>
>> > @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char
>> > *text, const char *text_end,
>> >              }
>> >          }
>> >
>> > -        switch (*text) {
>> > -        case '\r':
>> > -            break;
>> > -        case '\n':
>> > -            av_bprintf(buf, "\\N");
>> > -            break;
>> > -        default:
>> > -            av_bprint_chars(buf, *text, 1);
>> > -            break;
>> > +        len = get_utf8_length_at(text, text_end);
>> > +        if (len < 1) {
>> > +            av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in
>> > subtitle\n");
>> > +            len = 1;
>> > +        }
>> > +        for (i = 0; i < len; i++) {
>>
>> > +            switch (*text) {
>> > +            case '\r':
>> > +                break;
>> > +            case '\n':
>> > +                av_bprintf(buf, "\\N");
>> > +                break;
>> > +            default:
>> > +                av_bprint_chars(buf, *text, 1);
>> > +                break;
>>
>> Imo, the reindentation is not ok but this isn't my code.
>
> Why not?

Because the patch is much easier to read without it.

Carl Eugen


More information about the ffmpeg-devel mailing list