[FFmpeg-devel] Microsoft Smooth Streaming
nicolas.george at normalesup.org
Wed Oct 26 10:50:46 CEST 2011
Le quartidi 4 brumaire, an CCXX, Marcus Nascimento a écrit :
> Please, check the answers bellow.
That was more than perfect. Thanks.
> First of all, Microsoft Smooth Streaming basic idea is to encode the same
> video in multiple bitrates. The client can decide which bitrate to use. At
> any time it is possible to switch to another bitrate based on bandwidth
> availability and other measurements.
> Each encoding bitrate will originate an independent ISMV file (IIS Smooth
> Media Video I supose).
> The encoding keeps focus in the idea of fragmented structure that ISOFF (ISO
> File Format - the MP4 file format) offers. Keyframes are generated regularly
> and equally spaced in all ISMV files (2s).
> This is more restrictive than regular encoding procedures that allow some
> flexibility on keyframe intervals (I believe it, since I'm not an specialist
> on that).
> Important to say that all fragments always start with a keyframe.
> Each ISOFF fragment is perfectly aligned between different bitrates (in
> terms of time, of course. Data size may vary drastically). That alignment
> allows the client to request different bitrates for one fragment and switch
> to another bitrate in the next fragment.
> The ISMV file format is called PIFF and is based on the ISOFF with a few
> additions. There are 3 uuid box types that are dedicated to DRM purposes (I
> wont touch them here). Thus the meaning of PIFF: Protected Interoperable
> File Format. The PIFF brand (ftyp box value) is "piff".
> More on PIFF format here: http://go.microsoft.com/?linkid=9682897
> The server side (in the MS implementation) is just an extension to the IIS
> called IIS Media Services.
> That is just a web service that accepts HTTP requests with a custom
> formatted URL.
> The base URL is something like http://domain.com/video.ism (note that is not
> ISMV), which is never requested.
> By the time the client wants to play a video, it will request a Manifest
> file. The URL is <baseUrl>/Manifest.
For now, it sounds quite straightforward.
> The Manifest is just a XML file that provides some information regarding
> different streams and other information.
> Here is a basic example (modified parts of the original found here:
Do you know how much of the features of XML the manifest is allowed to use?
Writing a parser for well-balanced-tags-with-quoted-attributes is an easy
task, while supporting namespaces, external entities, processing
instructions, etc., is not.
> We can see it says the version of the smooth stream media and the duration
> (this is measured in 1 / 10,000,000 seconds).
> Next we see the video section which says each quality level has 4 chunks
> (fragments), with 2 quality levels available. It also says the video
> dimensions and the URL format.
> Next it gives information about each bitrate with codec information and
> codec private data (I believe it is used to configure the codec is a opaque
> Next it lists each fragment size. The first fragment would be referenced as
> 0 (zero), and the others as a sum of previous fragments size. I'm not sure
> exactly what those values mean.
> Next we have the same structure for audio description.
> After getting the Manifest file, the client must decide which quality level
> is best suited for the device and its resources.
> It is not clear to me on what parameters it bases it's decisions. I heard
> about size of the screen and its resolution, computing power, download
> bandwidth, etc.
I do not think you need to concern yourself with the heuristics for that:
that is for the application to decide, not the library implementing the
protocol. The library only needs to provide the information necessary to
make the decision.
Other may disagree, but I believe that if you manage to implement anything
at all (for example reading the first, or the best stream of each type, or
maybe reading all streams while honoring the discard flag), that would be a
very good starting point.
> As soon as the quality level is chosen, I suppose the decoder has to be
> configured in a suitable way, using the CodecPrivateData information
> The client then will start requesting fragments following the URL pattern
> given in the Manifest.
> To request the first fragment for the first quality level, it would follow
> the <baseUrl>/QualityLevel(0)/Fragments(video=0).
> To request the forth fragment for the second quality level, it would follow
> the <baseUrl>/QualityLevel(1)/Fragments(video=60060000).
> It is still possible to request just the audio following the same idea. For
> instance: <baseUrl>/QualityLevels(0)/Fragments(audio=20201360).
> Each fragment received is arranged in PIFF wire format. In other words:
> Contains exactly one moof box and exactly one mdat box and nothing
> more (check MP4 specs for more info).
> Of course there are internal boxes to those if applicable. It may contain
> custom uuid boxes designed to allow DRM protection. Lets not consider them
> I'm not sure which information I can get from the moof boxes, but I assume
> it would be relevant for the demuxer only, since the codec would only work
> on the mdat contained opaque data. Correct me if I'm wrong, please.
> The client would apply some heuristics while requesting fragments and
> sometime it may decide to switch to another quality level. I suppose it
> would have to reconfigure the decoder and repeat it over and over until the
> end of that.
> I'm not sure how a decoder works, but I believe there is a way to configure
> that in order to receive future "injected" data.
> If you get all the way here, I really thank you!
> I wonder how to fit all this into the ffmpeg structure.
I will elaborate slightly on top of what Michael wrote.
The "standard" scheme for ffmpeg has three completely separate layers:
protocol -> demuxer -> codecs
The protocol takes a string (an URL of some kind) and outputs a stream of
bytes. The most basic protocol is the file protocol, which takes a file name
and just reads that file. Protocols can be nested (for example mmsh
internally uses http which internally uses TCP), but that is an
implementation detail that is not seen in the API (yet; there are plans to
do something for complex multistreams protocols).
The demuxer reads a stream of bytes and then first populates a global data
structure, including one or several streams. Then it outputs a series of
packets. Packets are a sequence of bytes attached to a few simple
informations: size, timestamp, stream of attachment.
The codecs decode the packets. There is normally one codec per stream,
except if that stream is ignored. The codec initialize itself with the data
in the stream data structure, then accepts packets and possibly outputs
video frames, audio PCM data or anything else (subtitles).
AFAIK, in ffmpeg, the separation between demuxers and codecs has no real
exception. Which means that you should be able to ignore completely the
problem of codecs.
On the other hand protocols and demuxer sometimes need to work hand in hand.
In your particular case, the problem may be as simple as getting your
protocol handler to resynthetize proper ISOM headers and concatenate the
data to obtain a valid non-seekable ISOM stream.
At a later time, the ISOM demuxer could be adapted to be able to use the
seek-by-timestamp (read_seek) method that protocols can provide.
But that is just random thoughts, and I do not know enough of the ISOM
particulars to know if that is workable.
> I'm not that familiar with RTP but from what I've ready in the past few
> minutes it sounds similar.
From what you described, RTP and SDP files are too simple to be of any use
> Yes. I've seen something about it. It looks suitable for the case.
> It may be my starting point for studying.
I believe that you can use the HTTP protocol handler directly as a backend,
like mmsh does.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 198 bytes
Desc: Digital signature
More information about the ffmpeg-devel