[FFmpeg-user] IP camera recording via RTSP: audio/video desync (dropped frames?)

Sat Jul 16 10:21:21 EEST 2022

After another day of searching for solutions, I finally managed to come 
up with one. Remember I said that "-use_wallclock_as_timestamps 1" fixes 
the sync issues but causes the video to stutter? So I attempted to fix 
the stuttering with the following filter:

> setts='max(floor(PTS/X)*X,if(N,PREV_OUTPTS+X))'

Note the X: it has to be substituted with a constant depending on the 
use case (recording/streaming), and/or the stream the filter is being 
applied to.

Explanation: this filter expects input timestamps to be generated from 
the wallclock time but not necessarily spread out evenly. To fix 
stuttering, it adjusts the timestamps to be multiples of X = timebase 
times 1000 (for 25 FPS this is 0.04*1000=40) by computing the frame 
number and rounding it down to the nearest integer. It also ensures that 
the timestamps are always increasing - if the adjusted value is found to 
be less than the previous value plus one frame, then this sum is used as 
the output instead.

There is also a catch here - the camera can sometimes alter the frame 
rate, which causes the formula to produce weird results. Rectifying this 
is possible by assuming a constant frame rate for the input stream ("-r 
25").

Also note that for smooth playback, the filter has to be applied both to 
audio and video. When recording segments without re-encoding audio, X 
should be set to 40 for both streams (assuming 25 FPS). Streaming, 
however, is another matter. When RTMP streaming via "-f flv" (I use 
fifo+flv), X has to be set to 1 for the video stream (I think in this 
case "floor" can be dropped since all values seem to be integers, but 
not entirely sure). Audio is another beast entirely: X should be 320 
because the original sampling frequency is 8000 Hz mono, meaning 
8000/25=320 samples per frame. The filter also needs to be applied via 
"-af" instead of "-bsf:a" to operate on the source data.

The final command-line is as follows (both recording and streaming):

> ffmpeg -nostdin -flags low_delay -fflags +nobuffer+discardcorrupt \
> -rtsp_transport tcp -timeout 3000000 -use_wallclock_as_timestamps 1 \
> -r 25 -i rtsp://login:password@ip.ad.dre.ss:554/url \
> -map 0:v -c:v copy -bsf:v 
> setts='max(floor(PTS/40)*40,if(N,PREV_OUTPTS+40))' \
> -map 0:a -c:a copy -bsf:a 
> setts='max(floor(PTS/40)*40,if(N,PREV_OUTPTS+40))' \
> -f segment -strftime 1 -reset_timestamps 1 -segment_atclocktime 1 
> -segment_time 600 "%Y-%m-%dT%H-%M-%S.mkv"
> -map 0:v -c:v copy -bsf:v setts='max(floor(PTS),if(N,PREV_OUTPTS+1))'
> -map 0:a -c:a aac -ar 48000 -ac 2 -b:a 128k -af 
> asetpts='max(floor(PTS/320)*320,if(N,PREV_OUTPTS+320))' \
> -f fifo -fifo_format flv -drop_pkts_on_overflow 1 -attempt_recovery 1 
> -recover_any_error 1 -format_opts flvflags=no_duration_filesize 
> rtmp://<STREAM_URL>

NB: the current version of FFmpeg in the FreeBSD ports collection 
(4.4.2) needs these two patches for the proposed solution to work:

https://github.com/FFmpeg/FFmpeg/commit/301d275301d72387732ccdc526babaf984ddafe5
https://github.com/FFmpeg/FFmpeg/commit/b0b3fce3c33352a87267b6ffa51da31d5162daff

The first patch fixes the expression parser erroring out, and the second 
one fixes the PREV_OUTPTS value always equal to NOPTS. Also, "timeout" 
has to be replaced with "stimeout".

I'm still not sure if this solution is the proper one. So far, it's been 
running for many hours, and the resulting video is smooth as butter, and 
without any gradually increasing audio/video lag. But it looks extremely 
overcomplicated, not to mention it took me several days of researching 
and analyzing the video files to implement. Also, I don't know where the 
timestamp drift actually occurs - most signs point to the camera, but 
there's also the fact that some sort of conversion takes place depending 
on the output (e.g. segment/mkv measures timestamps in 1/1000ths of a 
second, but flv measures them in frames), and it might be possible that 
there's a bug somewhere in there.

For simplicity though, let's assume there's no bug, and the fault occurs 
at the source. We know that the audio is always on time, so why not use 
the timestamps of the audio packets for the video too? E.g. for each 
incoming video frame, assign it the timestamp of the latest audio packet 
received (not the wallclock time). The problem is that "setts" filters 
cannot interact with each other, so it's not possible to use them for 
this purpose.

Well, even though I've managed to somehow deal with this problem, I'm 
still no expert. So further comments are still welcome. Until then, I 
hope the information provided in this thread will be useful to anybody 
who encounters a similar issue.

Thank you very much.

---
Kind regards,
Vladimir