[Libav-user] Fwd: encode video and audio from .MOV

Tue Nov 4 15:19:48 CET 2014

On Tue, 4 Nov 2014 15:05:14 +0100
Clément Champetier <cnt at mikrosimage.eu> wrote:

> Hi everyone.
> 
> I have a question about the FFmpeg library, and more precisely about the *audio
> management* in case of no symmetry between video and audio frames.
> 
> First I will explain my situation.
> I work on an open source project, which is a C++ wrapper of FFmpeg, called
> avTranscoder (https://github.com/avTranscoder/avTranscoder). The goal is to
> provide a high level API to rewrap / transcode media easily.
> Currently we have an issue about one of our common case: get a MOV file,
> transcode video and audio at the same time, and create our output media
> (MXF).
> 
> Indeed, we decode, convert, and encode sequentially video and audio frames.
> *ffprobe* tells us about our input file:
> 
> *Stream #0:0(eng): Video: dnxhd (AVdn / 0x6E645641), yuv422p, 1920x1080,
> 121241 kb/s, 25 fps, 25 tbr, 25k tbn, 25k tbc*
> *Stream #0:1(eng): Audio: pcm_s24le (in24 / 0x34326E69), 48000 Hz, 2
> channels, s32, 2304 kb/s*
> 
> After few calculations, we expect:
> 
> *Video packet = 1 video frame*
> *Audio packet = 1920 audio samples (sample_rate / fps)*

Why would you expect that? It's wrong. Usually, a packet size will
correspond to the native frame size of the codec.

> 
> Unfortunately, we have more audio packets than video packets (there is in
> fact *1024 audio samples* per packet, as I can see with *ffprobe
> -show_packets*), and our output is "frame wrapped" (implying same video
> packets as audio packets). This requires to manage buffer(s) to encode
> audio packets correctly.
> 
> So my question is: how does ffmpeg buffer video and audio data in such a
> case ?
> After looking at the ffmpeg.c file, we believe the solution starts here:
> https://github.com/FFmpeg/FFmpeg/blob/master/ffmpeg.c#L713 Isn't it ?

You can't expect video and audio packets to be the same length.
Generally, audio and video are synced using timestamps, so audio and
video packets that should be displayed at the same (or overlapping)
time will have similar timestamps.

I think libavformat has some code for correct interleaving
(av_interleaved_write_frame() seems to be doing that), and in general
you can probably write frames as they are demuxed back to the demuxer.
If interleaving is broken, it'll probably buffer data. I'm mostly
guessing here, though.