[FFmpeg-devel] [PATCH] lavc: support subtitles charset conversion.
ubitux at gmail.com
Thu Jan 3 09:08:58 CET 2013
On Wed, Jan 02, 2013 at 11:20:13AM +0100, Nicolas George wrote:
> Le tridi 13 nivôse, an CCXXI, Clement Boesch a écrit :
> > I was considering the UTF-16 to UTF-8 as part of the demuxing: all the
> > text decoders are currently designed to deal with text only (and I don't
> > think that's a good idea to change this); they deal with a simple ASCII
> > text string, because that's pretty straightforward.
> > We must IMO just make the demuxers output ASCII-compliant charset in all
> > the cases, which will be sent to the different text decoders, and then
> > converted to UTF-8 is necessary (like proposed in the patch).
> That will work perfectly for subtitles coming from text files in strange
> encodings, but it will not work when the subtitles come from a muxed file,
> since we agree that special cases in real demuxers must be avoided.
> I do not have an example currently, but finding a format that can store text
> subtitles and has a metadata field for the encoding seems quite likely.
> Matroska mandates that subtitles are in UTF-8, but I am pretty sure someone
> somewhere produced Matroska files with UTF-16 text subtitles in them, and if
> someone reports them, we will want to support them.
> The way I see it, recoding may need to happen either before the demuxer,
> inside the demuxer, between the demuxer and the decoder, inside the decoder
> or after the decoder. And probably any of these case can be necessary in at
> least one situation: we need an API that can handle all.
> Since your patch is about lavc, we do not have to worry about the demuxer
> part, and only the before-decoder, inside-decoder, after-decoder parts have
> to be handled.
> A simple additional flag may be just enough:
> char *text_encoding;
> unsigned char text_encoding_mode;
> AV_TEXT_ENCODING_MODE_DEFAULT, //< let lavc decide
Detection based on what?
> AV_TEXT_ENCODING_MODE_MANUAL, //< the decoder does the work
Internally to the decoder, using the helper you're talking below?
> AV_TEXT_ENCODING_MODE_DONE, //< the demuxer did the work
Internally to the demuxer, using the helper you're talking below?
> AV_TEXT_ENCODING_MODE_PRE, //< lavc must recode the packet
Since lavc is not really supposed to modify the AVPacket (AFAIK), this
might be a bit painful (buf copy before decoding callback). Maybe it would
belong in a post-demux, but that may be a bit problematic for the stream
> AV_TEXT_ENCODING_MODE_POST, //< lavc must recode the decoded text
That sounds like the perfect place ;)
Except that it doesn't contain the buffer size, so it can only do ASCII
compliant charset conversions.
> Your patch already implements POST; implementing PRE the same way would be
> pretty trivial, it is just a matter of copying an AVPacket instead of an
> AVSubtitleRect; the other cases do not need implementing at all.
> > Now to make life easy for demuxers, we need to propose a few helpers to
> > transform UTF-16 input into UTF-8. The main problem I see currently is the
> > format detection of such encoding; it might require some tweaking in the
> > probing. Any idea welcome.
> I suggest this:
> * Try to detect a memory buffer text encoding and convert it to UTF-8.
> * @param[out] ret text in UTF-8 with 0-terminator
> * @param[in] in text in unknown encoding
> * @param[in] in_size size of in
> * @param[in] encodings coma-separated list of encodings to try (or NULL)
> * @param[out] encoding detected encoding
> * @param[out] remaining size of in that could not be recoded
> * @return score of the detection, or <0 error code
> int ff_recode_detect_buffer(char **ret, const char *in, size_t in_size,
> const char *encodings,
> char **encoding, size_t *remaining);
Note: inside the demuxer, you don't have access to the codec charset
(options are not yet populated). Inside the decoder that's possible.
> A similar ff_read_detect_recode_stream() taking an aviobuf would be helpful
> But that can come later.
I must say I have a hard time following what you actually want me to do.
Can you tell me more about what you want to want to expose to the user
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 490 bytes
Desc: not available
More information about the ffmpeg-devel