[FFmpeg-devel] [PATCH] lavc: support subtitles charset conversion.
nicolas.george at normalesup.org
Wed Jan 2 11:20:13 CET 2013
Le tridi 13 nivôse, an CCXXI, Clement Boesch a écrit :
> I was considering the UTF-16 to UTF-8 as part of the demuxing: all the
> text decoders are currently designed to deal with text only (and I don't
> think that's a good idea to change this); they deal with a simple ASCII
> text string, because that's pretty straightforward.
> We must IMO just make the demuxers output ASCII-compliant charset in all
> the cases, which will be sent to the different text decoders, and then
> converted to UTF-8 is necessary (like proposed in the patch).
That will work perfectly for subtitles coming from text files in strange
encodings, but it will not work when the subtitles come from a muxed file,
since we agree that special cases in real demuxers must be avoided.
I do not have an example currently, but finding a format that can store text
subtitles and has a metadata field for the encoding seems quite likely.
Matroska mandates that subtitles are in UTF-8, but I am pretty sure someone
somewhere produced Matroska files with UTF-16 text subtitles in them, and if
someone reports them, we will want to support them.
The way I see it, recoding may need to happen either before the demuxer,
inside the demuxer, between the demuxer and the decoder, inside the decoder
or after the decoder. And probably any of these case can be necessary in at
least one situation: we need an API that can handle all.
Since your patch is about lavc, we do not have to worry about the demuxer
part, and only the before-decoder, inside-decoder, after-decoder parts have
to be handled.
A simple additional flag may be just enough:
unsigned char text_encoding_mode;
AV_TEXT_ENCODING_MODE_DEFAULT, //< let lavc decide
AV_TEXT_ENCODING_MODE_MANUAL, //< the decoder does the work
AV_TEXT_ENCODING_MODE_DONE, //< the demuxer did the work
AV_TEXT_ENCODING_MODE_PRE, //< lavc must recode the packet
AV_TEXT_ENCODING_MODE_POST, //< lavc must recode the decoded text
Your patch already implements POST; implementing PRE the same way would be
pretty trivial, it is just a matter of copying an AVPacket instead of an
AVSubtitleRect; the other cases do not need implementing at all.
> Now to make life easy for demuxers, we need to propose a few helpers to
> transform UTF-16 input into UTF-8. The main problem I see currently is the
> format detection of such encoding; it might require some tweaking in the
> probing. Any idea welcome.
I suggest this:
* Try to detect a memory buffer text encoding and convert it to UTF-8.
* @param[out] ret text in UTF-8 with 0-terminator
* @param[in] in text in unknown encoding
* @param[in] in_size size of in
* @param[in] encodings coma-separated list of encodings to try (or NULL)
* @param[out] encoding detected encoding
* @param[out] remaining size of in that could not be recoded
* @return score of the detection, or <0 error code
int ff_recode_detect_buffer(char **ret, const char *in, size_t in_size,
const char *encodings,
char **encoding, size_t *remaining);
A similar ff_read_detect_recode_stream() taking an aviobuf would be helpful
But that can come later.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 198 bytes
Desc: Digital signature
More information about the ffmpeg-devel