[Libav-user] Video and audio timing / syncing

Thu Apr 4 06:03:20 CEST 2013

On Mar 31, 2013, at 11:36 PM, Brad O'Hearne <brado at bighillsoftware.com> wrote:
> Presuming there's no unknowns about changing the time_base.den on the fly throughout encoding, problem solved.

> Throughout the weeks of Googling and reading endless source code, forum / mailing list posts, blogs, etc. on this, I had picked up the impression that time_base.den was to be set once prior to encoding and not mucked with thereafter. However, I just used the duration to calculate the frame rate and now I'm setting the time_base.den prior to pts and dts for every frame. Works great. 

For the sake of those who might follow with a similar use-case, and as a basis for making a suggestion, I need to add a footnote to my previous email on resolution. As it turned out, the testing I did at that time which produced the "Works great" conclusion above didn't bounce the frame rate significantly to expose the fact that there was an actual problem doing this. By pot luck, I came across a manipulation of the video camera which significantly bounced the frame rate (it cut it in half), and when time_base.den was changed on the fly to match the new frame rate, the subsequent resulting pts in the new time_base units resulted in an inaccurate timing, and the "non-monotonically increasing..." error for pts and dts, and also the out-of-sync audio and video problem again.

As it turns out, the original impression I had picked up that time_base should not be changed on the fly is correct. The time_base should not be changed to match a variable frame rate, the time_base should be set up front and remain constant for the entire encoding process. Given the current definition of time_base, the proper way to handle a variable frame-rate is to do the following: 

1. Set time_base.den such that you can assume that the frame rate will never increase. I set my time_base.den value to 30, as I didn't foresee ever receiving a frame-rate higher than 30fps. 

2. Use pts and dts values which increment by 1 for every frame. 

3. Use the presentation time, duration and calculated frame rate of the received sample buffer to determine the frame rate, and whether the current frame should be encoded/written 0 times, 1 time, or multiple times based on how many frames the encoder is expecting at that specific pts. In other words, if the frame rate is bouncing around, a particular frame may need to be written only once (normal), multiple times (the frame rate has dropped), or not at all (frame rate has increased). 

That last step which ironed out all timing issues made clear to me some of the things I had seen in various examples on the Internet (though not the FFmpeg official examples) which spoke of "delayed frames". I'm not completely sure it was the exact same problem being addressed, but it made sense after having to do this the general idea in play -- bottom line, the implementer has to fabricate fixed-fps out of variable fps. 

As a point of suggestion, I would suggest that the FFmpeg maintainers either consider adding this fps smoothing for variable fps inside of avcodec, or alternatively reconsider the anchoring of time_base from the current potentially variable metric of frame rate to a fixed metric against which pts and dts can reliably and easily be converted. Frame rate is only truly fixed with either auto-generated frames (such as the FFmpeg examples) or when encoding a pre-existing file. But for live-capture, frame rate is variable -- hardware / software / latency etc., not to make mention of the fact that the capture mechanism in play (QTKit or otherwise) doesn't necessarily guarantee *any* particular frame rate.

I am not sure the design reason for making time-base be effectively frame-rate units, but as stated, frame-rate is a potentially varying metric. I would think that a time_base anchored to a fixed metric (such as time itself, e.g. milliseconds in a second - 1000) would be a much more reliable and versatile design, as it would serve fixed and variable frame rate scenarios equally well. I found it a little strange that I was receiving sample buffers from the capture mechanism with *exact* decode time, presentation time, and duration time, and yet while logically this is completely sufficient info to set frame timings, there were gymnastics and compensation required so as to accommodate a fixed frame rate, which as stated, in a live-capture scenario is basically fictional. 

If there's no alteration to the time_base design, then I would again encourage adding fps smoothing to avcodec. If event that is not possible or desirable, at least add the algorithm for doing so to the FFmpeg code examples. 

While I have some code cleanup yet to do, I have updated the video streaming part of my sample app to include this handling, if anyone now or down the road can benefit: 

https://github.com/BigHillSoftware/QTFFmpeg

Cheers, 

Brad