[FFmpeg-devel] [PATCH] lavc: support subtitles charset conversion.

Wed Jan 2 11:20:13 CET 2013

Le tridi 13 nivôse, an CCXXI, Clement Boesch a écrit :
> I was considering the UTF-16 to UTF-8 as part of the demuxing: all the
> text decoders are currently designed to deal with text only (and I don't
> think that's a good idea to change this); they deal with a simple ASCII
> text string, because that's pretty straightforward.
> 
> We must IMO just make the demuxers output ASCII-compliant charset in all
> the cases, which will be sent to the different text decoders, and then
> converted to UTF-8 is necessary (like proposed in the patch).

That will work perfectly for subtitles coming from text files in strange
encodings, but it will not work when the subtitles come from a muxed file,
since we agree that special cases in real demuxers must be avoided.

I do not have an example currently, but finding a format that can store text
subtitles and has a metadata field for the encoding seems quite likely.
Matroska mandates that subtitles are in UTF-8, but I am pretty sure someone
somewhere produced Matroska files with UTF-16 text subtitles in them, and if
someone reports them, we will want to support them.

The way I see it, recoding may need to happen either before the demuxer,
inside the demuxer, between the demuxer and the decoder, inside the decoder
or after the decoder. And probably any of these case can be necessary in at
least one situation: we need an API that can handle all.

Since your patch is about lavc, we do not have to worry about the demuxer
part, and only the before-decoder, inside-decoder, after-decoder parts have
to be handled.

A simple additional flag may be just enough:

    char *text_encoding;
    unsigned char text_encoding_mode;
    AV_TEXT_ENCODING_MODE_DEFAULT, //< let lavc decide
    AV_TEXT_ENCODING_MODE_MANUAL,  //< the decoder does the work
    AV_TEXT_ENCODING_MODE_DONE,    //< the demuxer did the work
    AV_TEXT_ENCODING_MODE_PRE,     //< lavc must recode the packet
    AV_TEXT_ENCODING_MODE_POST,    //< lavc must recode the decoded text

Your patch already implements POST; implementing PRE the same way would be
pretty trivial, it is just a matter of copying an AVPacket instead of an
AVSubtitleRect; the other cases do not need implementing at all.

> Now to make life easy for demuxers, we need to propose a few helpers to
> transform UTF-16 input into UTF-8. The main problem I see currently is the
> format detection of such encoding; it might require some tweaking in the
> probing. Any idea welcome.

I suggest this:

/**
 * Try to detect a memory buffer text encoding and convert it to UTF-8.
 *
 * @param[out] ret        text in UTF-8 with 0-terminator
 * @param[in]  in         text in unknown encoding
 * @param[in]  in_size    size of in
 * @param[in]  encodings  coma-separated list of encodings to try (or NULL)
 * @param[out] encoding   detected encoding
 * @param[out] remaining  size of in that could not be recoded
 * @return  score of the detection, or <0 error code
 */
int ff_recode_detect_buffer(char **ret, const char *in, size_t in_size,
                            const char *encodings,
                            char **encoding, size_t *remaining);

A similar ff_read_detect_recode_stream() taking an aviobuf would be helpful
too.

But that can come later.

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130102/1cd7c8c3/attachment.asc>