[FFmpeg-user] How does the mkv demuxer really work?

Wed Oct 8 13:37:50 CEST 2014

Hi

I have a question that is perhaps in between ffmpeg-user and
ffmpeg-devel. In doubt, i sent it to this ML.

First: a few words on the background and the problem:

I am trying to create a solution for realtime streaming of transcoded
AV files that supports seeking in the media.
That is, a web server that serves local videos, but, upon request, it
does not serve the video "as is", but it transcodes the video and
stream it to the client via http.
This would be very easy with ffmpeg, however i want to also offer to
the client the possibility to seek in the media.

So, the way i thought of doing this is :

0) transcoding using only h264 and a fixed keyframe interval (say 250
frames, for example).

1) using matroska as media container

2) generating a valid matroska header that contains an artificial
position (offset) of the cues, since the server does not know the
position until it finish transcoding the whole file, which is not gonna
happen by design.

3) generating a valid, but artificial, cue block, containing <movie
lengthinsecs/keyframeintervalinsecs> number of cue points, with correct
time points, but an artificial offsets (see below).

4) using a custom web server that is able to recognize the cue block
offset in the header, and serve the artificial cue block of point 3)

5) making the custom web server also interpret the artificial offset of
3) in order to extract the time offset from the byte offset of a seek
request.

6) making the custom web server start a new ffmpeg -ss <appropriate
time offset> upon each seek (= range http request that is not a cue
block request)

So, i started to prototype this tool ( you can find my very first ugly
attempt that does not work here:
https://github.com/paoletto/mediastreamer )

What i discovered is that the behavior of the ffmpeg matroska
demuxer used in players like mplayer or mpv-player is very predictable
using a valid mkv file (that is, with correct header and cues).

What it does (seeing from the requests received server side) is:
- start reading from the beginning.
- upon the first seek request, the player sends a range http request to
  the cues offset and retrieves the cues
- then, upon each seek, the player sends range http requests to the
  server, with offsets that match cluster offsets (this is more or less
  true. I noticed that, for example, mpv-player requests exact cluster
  offsets, while mplayer requests offsets that are a little bit before
  the cluster boundaries, but still very close)

Now, this all looks well, but when i feed my artificial cue block,
containing cue points with offsets that are , say, multiples of 10000
(or whatever other amount i tried), then the requests that arrive to
the server make no sense, like if the demuxer in the player completely
ignores the cues, and start firing range requests to euristically find
the positions.

My question is then why this different behavior? and how does the
matroska demuxer really work?

thanks and sorry for the long message

ps. i suspect that, even if the demuxer behavior can be tamed, to get
the whole idea working, we (i) would possibly need to run a modified
version of ffmpeg server side, as, from what i understand, the demuxer
needs correct time offset in the clusters it receives.