[Libav-user] Video and audio timing / syncing

Mon Apr 1 01:31:33 CEST 2013

On Mar 31, 2013, at 1:25 AM, Alex Cohn <alexcohn at netvision.net.il> wrote:
> I am not sure when "duration" is taken into account, but you could
> simply set current->pts = prev->pts+2. Note that this was my original
> proposal.

Alex -- I considered that approach, and essentially ran that same test -- manually manipulating the pts values that is. But the problem with this is twofold: 

- A "+2" pts increase likely will only work when you have an actual frame rate that is half that of the expected frame rate which is used to initially set the time_base. That's not my use case (the closest approximation is 24 fps expected, and on this particular computer / camera I'm testing with, 15fps. And I think in theory, the only video source that can feed an encoder with fixed-fps video is one that is fully known and controllable prior to encoding (like an existing video file, or just generating data like the examples do). More about this in a moment. 

- If I'm going to muck with the video pts in this fashion, the audio pts also has to be mucked with to keep it in sync, and of course, audio samples are received at a different rate than the video. So that's a complexity there. 

The bigger picture is the matter of having encoding determined by a frame rate-based time_base. As I alluded to above, unless you are just generating audio/video at runtime, or are reading from a file where the video obviously already exists and all the meta-data is available up front, I would think that any live capture video source is theoretically variable in nature, due to almost certain variances in hardware, computer, etc. 

That said, it raises the question of having a time base in terms of frame rate, rather than time-base in terms of time. One thing I encountered with some frequency in my Google journeys were blog posts and examples discussing actually setting time_base.den to 1000, not in terms of frame rate, but milliseconds in a second. This seems to make more sense to me, but there must be a reason, that time_base was pinned to frame rate (if anyone knows, I'd be interested to hear). 

I'm not sure if this was the intended design or not, but it seems peculiar to me that while I have sample buffers that provide exact presentation time (with scale), decode time (with scale), and duration time (with scale) for every single frame, this information does appear to be enough on its own to encode the frames with proper timing, simply because the codec context needs a fixed frame rate configured up front before encoding begins. Somehow that just seems wrong, and I feel like I must be missing something simple -- having all of this information should be enough for encoding. 

I'll ask this one again -- because I would think this would be the way to iron out the discrepancy: what is the net effect of duration on timing? If I set an accurate pts and dts (which I am, I have the exact info for these), why does setting the exact duration of the frame not account for the other missing piece of the puzzle. Essentially that info equals sequence, start time and length -- that would seem complete...any thoughts?

Brad