[Libav-user] Video and audio timing / syncing

Brad O'Hearne brado at bighillsoftware.com
Mon Apr 1 04:50:06 CEST 2013

On Mar 31, 2013, at 6:32 PM, Kalileo <kalileo at universalx.net> wrote:

Kalileo -- thanks for the reply. I'm not sure if you've read this thread and everything I've written, but based on the questions it appears you may have missed a post or two, so please forgive me if there's a rehash here.

> There's a lot of half-theory in your questions, and i get confused about your intentions. Do you want to solve a problem (of video/audio not being in sync) or do you want to redesign the dts/pts concept of video/audio syncing?

> Didn't you say that it's _not_ in sync now? So obviously you've to correct one side, not do the same modification on both sides.
> I do not understand why you need to make this so complicated. It is so easy, same PTS = to be played at the same time.

I'll do my best to distill this all down as simply as possible. 

Capture video and audio from QTKit, pass the decompressed video and audio sample buffers to FFmpeg for encoding to FLV, and output the encoded frames (ultimately to a network stream, but in my present prototype app, a file). This is live capture-encode-stream use-case where the video is then being broadcast and played by multiple users in as near real-time as possible. Latency and delay needs to be minimized and eliminated to the degree it is possible.

I have finally determined through many hours of testing that the problem here is NOT pts and dts values I am assigning. The values I am assigning to pts and dts are 100% accurate -- every video and audio sample buffer received from QuickTime (QTSampleBuffer) delivers its  exact presentation time, decode time, and duration. When I plug these values into the AVPacket pts and dts values, video and audio is perfectly synced provided that -- and here's the crux of the issues -- the time_base.den value matches EXACTLY the *actual* frame rate of captured frames being returned. If the actual frame rate is different from the frame rate indicated in time_base.den, then the video does not play properly. In my specific use case, I had configured a minimum frame rate of 24 fps on my QTDecompressedVideoCaptureOutput, and so expecting that frame rate, I configured my codec context time_base.den to be 24 as well. What happened, however, is that despite being configured to output 24 fps, it actually output fewer fps, and when that happened, even though the pts and dts values were the exact ones delivered on the sample buffers, the video played much faster than it should, while the audio was still perfect. So I manually went through my console log, counted how many frames per second were actually being received from capture (15), and hard-coded 15 as the time_base.den value. I reran my code with no other changes, and the video and audio is synced perfectly. The problem is the nature of the time_base, and however internally it is being used in encoding. 

Here is the present problem in a single statement: the encoding process requires that the time_base.den value on the codec context be set *prior to encoding* to a fixed fps, but if actual fps varies from the time_base.den fps, the video doesn't play properly (and also any relative adjustment you try to make to pts in time_base units will be off as well). That's it in a nutshell -- there's no guarantee that a capture source is going to deliver frames at the fixed fps in the time_base, and if it doesn't, timing is off.

I don't know how the various codecs work internally (mine is adpcm_swf), but just from pounding on them with tests from the outside, it appears that the time_base.den governs most everything. As stated, unfortunately it wants a fixed value for a variable unit (in a capture scenario), so even though I have presentation time, decode time, and duration, the disparity between the actual frame rate and the time_base.den throws everything off.

I am curious about the purpose and use of the AVPacket.duration value. I'm suspecting it isn't being used at all. I cannot verify this at this point, but I'm suspecting that one possibility of what is happening is that QuickTime could be accomplishing a 30fps frame rate by delivering 15fps with single frame duration * 2. I'd guess that if the codec context had a time_base oriented to time (such as milliseconds), a metric which does not fluctuate, and duration was considered, none of this would be a problem. Not knowing the internals of avcodec, however, I cannot say for sure. 

But QuickTime stuff is a different issue, and on the QuickTime side of things, and it doesn't change the problem in FFmpeg (and we are going to be doing the same thing on a Windows box soon, so it will be the same thing there with Windows hardware) -- differing cameras, computers, etc., the capture frame rate cannot be assumed as fixed (nor is it known up front) so having to specify an accurate fixed FPS for time_base.den is problematic, unless there's another way to rectify the problem. 

I hope that helps clear up any confusion. 



More information about the Libav-user mailing list