[FFmpeg-trac] #6021(avcodec:new): tx3g / mov_text subtitles are not encoded correctly in some specific cases

Wed Dec 14 08:25:49 EET 2016

#6021: tx3g / mov_text subtitles are not encoded correctly in some specific cases
-------------------------------------+-------------------------------------
             Reporter:  erikbs       |                     Type:  defect
               Status:  new          |                 Priority:  normal
            Component:  avcodec      |                  Version:  3.2.1
             Keywords:  utf8         |               Blocked By:
  mov_text ttxt tx3g subtitles mp4   |  Reproduced by developer:  0
             Blocking:               |
Analyzed by developer:  0            |
-------------------------------------+-------------------------------------
 Consider the following command:
 {{{
 ffmpeg -i input.mp4 -i input.srt -c:a copy -c:v copy -c:s mov_text
 output.mp4
 }}}
 for converting SRT subtitles into 3GPP timed text (TTXT) and embedding
 them inside an MPEG4 container. Until recently, ffmpeg ignored any
 formatting/styling in the SRT file and just converted the raw text and
 timestamps instead. That produced files that had no problems.

 In SRT files, the start and end points of the text to be formatted are
 determined by tags, e.g. <i> and </i>. In TTXT subtitles, the start and
 end points are instead saved as numbers. Currently ffmpeg measures these
 values in bytes, but it looks like they should be measured in characters
 instead. For example, the Chinese character 我 consists of three bytes,
 but is considered a single character.

 The problem arises when I try to convert an SRT where a line contains
 multibyte characters and a formatted string, and there are less than X
 characters between the formatted string and the end of the line, where X
 is the difference between the length of the line in bytes and the length
 in characters. Take for example this SRT line:
 {{{
 1
 00:00:01,000 --> 00:00:02,000
 The character 我 consists of three bytes
 <i>this string will cause problems</i>
 }}}
 Measured in characters, the formatted string starts at position 40 and
 ends at position 71 (i.e. the character at position 71 is not part of the
 string). Measured in bytes (excluding the tags of course), the string
 starts at position 42 and ends at position 73, which are the values ffmpeg
 stores inside the output file. When I open this file in QuickTime Player
 on Mac OS X, it seems to expect that these numbers be measured in
 characters. Since 我 is counted as one character, it proceeds to read off
 the end of the line (i.e. past the 73rd byte), resulting in an instant,
 brutal crash. VLC, which appears to handle errors better, either tries to
 correct the error or just ignores the formatting when invalid data is
 found.

 I used MP4Box to convert the SRT to TTXT and to extract the TTXT from the
 MP4 generated by ffmpeg using
 {{{
 mp4box -ttxt input.srt # convert SRT to TTXT
 mp4box -ttxt 3 output.mp4 # extract the third stream (subtitles)
 }}}
 When I compared the output files, it immediately became clear that MP4Box
 counts characters while ffmpeg counts bytes. During testing I was able to
 confirm that VLC counts in the same way as MP4Box and QuickTime: in
 characters, not bytes. It should also be mentioned that the standalone
 TTXT files, which are XML files, contain the properties ''fromChar'' and
 ''toChar'', further indicating that we should count characters and not
 bytes. When stored inside an MPEG4 container, the TTXT files are
 “compressed” into some binary format I do not fully understand instead of
 using XML style tags. By replacing the correct bytes in the file produced
 by ffmpeg (byte count --> character count) using a hex editor, the file
 played correctly in VLC and QuickTime (with the correct letters
 italicized), and I also got MP4Box to extract an SRT that looked correct
 (without the hex editing, the SRT generated by MP4Box from the file
 produced by ffmpeg had the tags in wrong place).

 The erroneous data seems to be written by the function ''encode_styl'' in
 the file ''libavcodec/movtextenc.c'', at line 108-109 to be precise. Here
 the raw byte positions are written to the files. These are passed to the
 function through an ''MovTextContext'' struct, which has a member called
 ''style_attributes'' – an array where each element corresponds to a
 formatted string. Each element in this array is another struct, having
 members such as ''style_start'' and ''style_end''. At the moment I have no
 idea where these values are produced, but the ''encode_styl'' function
 writes them to the file.

 '''To correct the bug''', the code that counts bytes and writes these
 values to the struct that is eventually passed to ''encode_styl'' should
 be corrected, so that it counts characters instead. I guess ffmpeg already
 has code for counting utf8 characters, but if it has not, then
 [http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html this
 article] presents various ways of doing it.

--
Ticket URL: <https://trac.ffmpeg.org/ticket/6021>
FFmpeg <https://ffmpeg.org>
FFmpeg issue tracker