[Libav-user] Suggestions on pts / dts

Mon Apr 15 20:09:48 CEST 2013

Hello, 

I've been on the FFmpeg-user and Libav-user mailing lists for over a year now, and from the problems I've both had myself and those others have posted, if stats were kept, I'd wager that by far, most of the problems encountered on both lists deal with timing issues, related to pts / dts. The general understanding and documentation surrounding pts / dts on the coding side is scant to non-existent, and after literally weeks of testing pts and dts issues, I have a few comments / suggestions which hopefully might make the situation a little better for those trying to use FFmpeg and understand how it works. Note that I'm still in the dark about the finer points myself, and may have misconceptions, but these comments come out of my direct experience having to solve specific problems with the libraries. 

1. It seems apparent from testing I've done that pts used in encoding is relative to a time_base based on pre-configured / desired frame rate, while the pts used in writing/streaming is relative to a time_base based on time -- the stream's time_base.den was 1000 by default, so I assume this is a measure of milliseconds. So to sum up, when feeding frames to the encoder, pts indicates what frame/sample the packet contains (by which the timing is derived), while when feeding packets for writing, pts indicates an actual time increment. The former scenario with encoding is a fairly optimistic scenario, assuming an exact frame rate and no gap in frames. That is fine for processing existing videos, but not particularly good for live capture scenarios -- as a suggestion it would be better if the encoding pts was more like the writing pts. Regardless, it would be very helpful if there were documentation out there which explained the differences in the nature of pts prior to encoding vs. pts prior to writing, and the conversion between the two. Code snippets and discussions around the Internet are all over the map on this, with disagreement.  

2. When invoking avcodec_encode_audio2 and it returns a packet, I haven't been able to figure out why the packet returned by the encoder doesn't have the pts set properly. The encoder has all the information necessary to know what the pts should be for the returned packet. So as a manner of suggestion, why not set this packet's pts in the encoder, rather than pushing that onto the implementer? I suppose there's the argument that you don't necessarily know what the returned packet will be used for, but will it ever *not* need to be set to the proper pts for writing anyway? And if not, why not add both encoded pts and write pts to the packet, if both are necessary for different scenarios? 

3. I would suggest either revising the muxing.c example, or adding a muxing2.c example which addresses the actual issues encountered in an encoding / writing scenario with pts / dts. The muxing.c example might even be adding to the confusion -- it did in my case. There are a number of things which this example does not address, which were required in my use case, and if other posts on the mailing list are any indication, common to others' as well: 

- Muxing.c does not set the time_base for audio. 

- Muxing.c does not set a pts (ever) for audio packets, either prior to encoding or writing

- Muxing.c does not set a pts for video packets prior to writing. It sets pts on the video frame *after* writing, which effectively means before encoding (in a loop). But it doesn't appear to ever deal with pts prior to writing. 

- Muxing.c has a guaranteed constant frame rate fed to the encoder (consistent with time_base.den) -- again, fine for generated and pre-existing content, but doesn't work for capture scenarios, where frames may drop. Some of the past responses to questions I had about this commented that setting the proper pts regardless would solve the problem. In my testing, this was not the case. You can set the proper pts for a video frame, but if you do not send the exact number of frames to the encoder to maintain the frame rate configured in your codec context's time_base.den value, your pts will *not* correct that problem. Your video timing will not be right. So in other words, if you configure a time_base.den of 30, and then feed it 15 fps, *even if this pts values are accurate for each of the 15 frames* (and that's taking into account half the frame rate), your encoded video will end up playing at twice the rate (in half the time). The pts value does not solely determine timing and playback -- your pts values can be accurate and your playback timing STILL off. The solution is that you must guarantee the frame rate configured in time_base.den, and so this requires compensating by re-encoding the last frame or encoding the current frame multiple times if necessary. Muxing.c really needs to address this scenario -- generating faux video and audio sidesteps this reality completely. 

- Muxing.c does not address network streaming, which apparently has different nuances which impact encoding / writing. I cannot say exactly why, as I do not know the internals of the encoder or stream writing code, but there is a difference between writing a local file and streaming, with respect to required pts and dts values. I was able to successfully encode/write perfect video only, audio only, and video/audio together into local video files and play them back without a problem, but when pushing the exact same data across a network stream, it resulted in bad video, bad audio, or video / audio syncing problems. It was thanks to the server-side logging of the system I used (Wowza 3.5.x) which told me that timecodes being received weren't right (and also fingered the encoder as the problem) -- however, this didn't have any negative effect on writing a local video file and play it back -- I didn't even know the problem existed until I streamed across the network. The problem was completely solved by changing pts / dts values -- if these values are bad, I'm not sure why they would be considered good in a local writing scenario but bad in a streaming scenario -- I would think that pts / dts should be either good or bad no matter where it is written, and either scenario would generate like errors. 

I also encountered one additional problem, using the av_write_frame call worked fine for audio only and video only, but would crash the stream if both audio and video where written. I had to switch to av_interleaved_write_frame. Perhaps this is expected behavior, but I was unaware of it and why -- if it is, this would be good to document. 

So to sum up -- , FWIW from the issues I am seeing  appear on the mailing list, my weeks of addressing this, and coming across all manner of opinions on what it all means and how to handle it, I'd suggest addressing pts/dts and its specific usage and intent with AVPacket both prior to encoding and prior to writing -- documentation and an example which addresses this stuff -- I think it might help a lot of people both understand what FFmpeg is doing in regards to pts/dts and get their code working. 

Thanks for all of your help....

Cheers,

Brad