[FFmpeg-user] Audio normalization using "volume" and "compand" filters

Wed Nov 27 19:47:42 CET 2013

Hello,

A have a couple of questions regarding the normalization of audio levels in
a file. Note that I am not an audio expert.

The idea is to bring the RMS level to a given value "out_rms", say -20 dBFS,
with a given maximum peak level "out_peak_max", say -1 dBFS. In a first pass,
I measure "in_rms" and "in_peak" using the audio filter "volumedetect".

My first thought is to use the audio filter "volume" if the input dynamics
is less than the output one (ie. in_peak - in_rms < out_peak_max - out_rms).
I expect to shift the whole signal to out_rms without distorsion and keep
out_peak < out_peak_max.

My second thought is to use the audio filter "compand" to adjust the volume
and compress the dynamics if the input dynamics is too large. I expect to
obtain out_rms for the mean level and out_peak = out_peak_max.

First problem: The filter "volume" works but the actual adjustment is always
shifted by -0.5dB from the requested value. For -af volume=+5dB, I get an
actual +4.5 dB. For -af volume=-10dB, I get -10.5 dB. Etc. See the details
below.

Second problem: The usage of the filter "compand" is extremely obscure. Its
documentation (http://ffmpeg.org/ffmpeg-filters.html#compand) can hardly be
understood if you do not already know the meaning of each parameter. See
below some tests I made without deeply understanding what they mean.

------------------------

More details on the -0.5dB offset with the filter "volume":
Let's take as input an MP3 file with low-level audio volume.

> ffmpeg -i test-audio-1.mp3 -af volumedetect -f null -y nul
ffmpeg version 2.1.1 Copyright (c) 2000-2013 the FFmpeg developers
  built on Nov 20 2013 21:13:48 with gcc 4.8.2 (GCC)
  configuration: --enable-gpl --enable-version3 --disable-w32threads --enable-avisynth --enable-bzlib --enable-fontconfig --enable-frei0r --enable-gnutls --enable-iconv --enable-libass --enable-libbluray --enable-libcaca --enable-libfreetype --enable-libgsm --enable-libilbc --enable-libmodplug --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-librtmp --enable-libschroedinger --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvo-aacenc --enable-libvo-amrwbenc --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libx264 --enable-libxavs --enable-libxvid --enable-zlib
  libavutil      52. 48.101 / 52. 48.101
  libavcodec     55. 39.101 / 55. 39.101
  libavformat    55. 19.104 / 55. 19.104
  libavdevice    55.  5.100 / 55.  5.100
  libavfilter     3. 90.100 /  3. 90.100
  libswscale      2.  5.101 /  2.  5.101
  libswresample   0. 17.104 /  0. 17.104
  libpostproc    52.  3.100 / 52.  3.100
Input #0, mp3, from 'test-audio-1.mp3':
  Metadata:
    encoder         : Lavf55.19.104
  Duration: 00:01:00.00, start: 0.000000, bitrate: 96 kb/s
    Stream #0:0: Audio: mp3, 48000 Hz, stereo, s16p, 96 kb/s
Output #0, null, to 'nul':
  Metadata:
    encoder         : Lavf55.19.104
    Stream #0:0: Audio: pcm_s16le, 48000 Hz, stereo, s16, 1536 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (mp3 -> pcm_s16le)
Press [q] to stop, [?] for help
size=N/A time=00:01:00.00 bitrate=N/A    
video:0kB audio:11248kB subtitle:0 global headers:0kB muxing overhead -100.000191%
[Parsed_volumedetect_0 @ 0000000002888ba0] n_samples: 5758942
[Parsed_volumedetect_0 @ 0000000002888ba0] mean_volume: -34.6 dB
[Parsed_volumedetect_0 @ 0000000002888ba0] max_volume: -14.5 dB
[Parsed_volumedetect_0 @ 0000000002888ba0] histogram_14db: 23
[Parsed_volumedetect_0 @ 0000000002888ba0] histogram_15db: 92
[Parsed_volumedetect_0 @ 0000000002888ba0] histogram_16db: 219
[Parsed_volumedetect_0 @ 0000000002888ba0] histogram_17db: 778
[Parsed_volumedetect_0 @ 0000000002888ba0] histogram_18db: 2862
[Parsed_volumedetect_0 @ 0000000002888ba0] histogram_19db: 7000

The RMS level is -34.6 dBFS and the peak level is -14.5 dBFS.

I apply the audio filter "volume" with the following command and check the
audio levels of the output file using "-af volumedetect" again.

  ffmpeg -i test-audio-1.mp3 -af "volume=...dB" -y test-audio-2.mp3
  ffmpeg -i test-audio-2.mp3 -af volumedetect -f null -y nul

Here are the results with various volume options:

Input file:
  mean_volume: -34.6 dB
  max_volume: -14.5 dB

Using -af "volume=+5dB":
  mean_volume: -30.1 dB => +4.5
  max_volume: -9.9 dB   => +4.6

Using -af "volume=+13.5dB":
  mean_volume: -21.6 dB => +13
  max_volume: -1.4 dB   => +13.1

Using -af "volume=-10dB":
  mean_volume: -45.1 dB => -10.5
  max_volume: -24.9 dB  => -10.4

So it seems that the actual adjustment is the requested value - 0.5 for the
RMS level and - 0.4 for the peak.

Is there any rational explanation for this?

How can we compute the right volume adjustment for bringing the input RMS
level to a precise value (apart from blindly applying a hardcoded +0.5 to
compensate the observed behavior)?

------------------------

Some tests using the filter "compand".

Let's use the same input file with in_rms = -34.6 dBFS and in_peak = -14.5
dBFS. Let's try to compress the dynamics and adjust the level so that
out_rms = -10 dBFS and out_peak = -1 dBFS. This is too compressed, I know,
this is just to test the filter.

My understanding is to get a dynamics of peak - rms = 9 dB, so compress the
signal centered on the in_rms = -34.6 dBFS with a peak of -25.6 dBFS and then
apply a gain of +24.6 dB. Am I right or completely out?

Using a mixture of recommended values in the documentation and what I think
I understand of the "compand" parameters, I tried the following filters with
the following results.

Input file
  mean_volume: -34.6 dB
  max_volume: -14.5 dB

-af "compand=attacks=0.3 0.3:decays=0.8 0.8:points=-90/-900 -70/-70 -34.6/-34.6 -14.5/-25.6 0/-25.6:soft-knee=0.01:gain=24.6:volume=0:delay=0.8"
  mean_volume: -12.6 dB
  max_volume: 0.0 dB

I though that the points "-90/-900 -70/-70 -34.6/-34.6 -14.5/-25.6 0/-25.6"
would keep -34.6 dbFS at its level and compress -14.5 dBFS and higher to
-25.6 and then the signal would be shifted up to out_rms = -10 dBFS and
out_peak = -1 dBFS. RMS is not that far from -10 dBFS but there is clearly
some clipping.

Same test without the additional gain:

-af "compand=attacks=0.3 0.3:decays=0.8 0.8:points=-90/-900 -70/-70 -34.6/-34.6 -14.5/-25.6 0/-25.6:soft-knee=0.01:gain=0:volume=0:delay=0.8"
  mean_volume: -36.7 dB
  max_volume: -17.1 dB

The RMS has been lowered and the dynamics is 19.6 dB instead of 9.

Could someone please explain how to use "compand" and its parameters?

More precisely, how can we compress an input audio with the characteristics
in_rms and in_peak into a given target out_rms and out_peak?

Thanks for your help.
-Thierry