[FFmpeg-user] Advice on using silence removal

Sat Aug 21 21:48:32 EEST 2021

On Fri, Aug 20, 2021 at 9:47 AM Alex R <ralienpp at gmail.com> wrote:

> Hi everyone,
>
> I am attempting to leverage ffmpeg in a project that involves recording
> short audio clips. So far I have gotten some mixed results and I'd like to
> tap into your collective knowledge to ensure my approach is sound.
>
> Context:
> - a person records an audio clip of themselves pronouncing a word (imagine
> that you read aloud a flash-card that says "tree" or "helicopter")
> - the recording is usually made on a mobile phone
>
> The clip contains some silence at both ends, because there is a delay
> between the moment the user presses the record button, the moment they
> pronounce their word, and the moment they press "stop". Depending on the
> device, there may also be an audible click in the beginning.
>
> My objective is to trim the silence at both ends and apply fade-in/out to
> soften the clicks, if any.
>
> The challenges are:
> - ffmpeg's silenceremove filter needs a threshold value, however,
> - each user is in their own environment, with different levels of ambient
> noise
> - each device is unique in terms of sensitivity
>
> Thus, I can achieve my desired result with one specific clip through trial
> and error, tinkering with thresholds until I get what I need. But I cannot
> figure out how to detect these thresholds automatically, such that I can
> replicate the result with a broad range of users, environments and
> recording devices.
>
> Note that there is no expectation to produce perfect results that match the
> quality of an audio recording studio, I'm more in the "rough, but good
> enough for practical purposes" territory.
>
> Having read the documentation and various forums, I put together this
> pipeline (actual commands in the appendix):
>
> 1. run volumedetect to see what the maximum level is
> 1a. parse stdout to extract `max_volume`
> 2. normalize audio to `max_volume`
> 3. apply silenceremove with <empirically determined threshold>
> 3a. for the beginning of the file
> 3b. invert the stream and run another silenceremove for the beginning
> (which is actually the end)
> 3c. invert it back and save the output
>
>
>
> What I read in the forums gave me the impression that we need step#2 such
> that at step#3 we could say the threshold is 0. However, that is not the
> case, I still had to find a reasonable threshold via trial and error.
>
> After I found a value that produces a good result, I assumed that it might
> be good enough for practical purposes and it would be OK to simply hardcode
> it into my code as a magic number. However, on the next day I attempted to
> replicate the results using the same recording device in the same room -
> but this time ffmpeg would tell me the filtered stream is empty, nothing to
> write. The environment wasn't 100% identical, since I'm not doing this in a
> controlled lab, but most of the variables are the same, though perhaps the
> windows were open and it was a different time of the day, so the baseline
> noise level outside was somewhat different.
>
> Clearly, my approach is not robust. I'd like to understand whether there
> are any low-hanging fruits that I can try, or if I'm not on the right
> track.
>
> I imagine that the solution I need would somehow determine the silence
> threshold relative to the rest of the file, instead of using a "one fits
> all" value. However I did not find such filters or analyzers in ffmpeg.
>
>
> Your guidance will be greatly appreciated,
> Alex
>
>
>
>
> Appendix, pipeline commands
>
> 1. ffmpeg -i input.mp3 -af "volumedetect"  -f null /dev/null
> here I parse stdout, looking for something like "[Parsed_volumedetect_0 @
> 0x559dbe815f00] max_volume: -15.9 dB"
>
> 2. ffmpeg -i input.mp3 -af "volume=15.9dB" out2-normalized.mp3
>
> 3. ffmpeg -i out2-normalized.mp3 -af
>
> silenceremove=start_periods=1:start_duration=0:start_threshold=-6dB:start_silence=0.5,areverse,silenceremove=start_periods=1:start_duration=0:start_threshold=-6dB:start_silence=0.5,afade=t=in:st=0:d=0.3,areverse,afade=t=in:st=0:d=0.3
> out3-trimmed.mp3
>

Use window option too, also set detection to peak if rms (default value) is
not working as expected.
There is not much that can be done if silence is in variable dBFS and
changes much.

>
>
> An example of an input file is available at
> railean.net/files/public-temp/in-fresh.mp3, after normalization you can
> hear some church bells in the distance. I'm totally fine with them
> remaining audible in the result, as long as the leading and trailing
> silence is removed.
> _______________________________________________
> ffmpeg-user mailing list
> ffmpeg-user at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-user
>
> To unsubscribe, visit link above, or email
> ffmpeg-user-request at ffmpeg.org with subject "unsubscribe".
>