[FFmpeg-trac] #10458(undetermined:new): MPEG4 demuxing: last sample's duration ignored (was: MPEG4 AAC decoding: end padding not trimmed)
FFmpeg
trac at avcodec.org
Tue Jul 11 14:48:45 EEST 2023
#10458: MPEG4 demuxing: last sample's duration ignored
-------------------------------------+-------------------------------------
Reporter: John Regan | Owner: (none)
Type: defect | Status: new
Priority: normal | Component:
| undetermined
Version: unspecified | Resolution:
Keywords: | Blocked By:
Blocking: | Reproduced by developer: 0
Analyzed by developer: 0 |
-------------------------------------+-------------------------------------
Changes (by John Regan):
* summary: MPEG4 AAC decoding: end padding not trimmed => MPEG4 demuxing:
last sample's duration ignored
Old description:
> It seems like ffmpeg is properly removing the front padding from m4a
> files, but doesn't account for the end padding added to round frames up
> to 1024 samples.
>
> How to reproduce:
> {{{
> % ffmpeg -f lavfi -i anullsrc=r=48000:d=2 source.wav
>
> # verify the created audio file as exactly 96000 samples
> % soxi -s source.wav
> 96000
>
> # encode to aac
> % ffmpeg -i source.wav -c:a aac encoded.m4a
>
> # decode back to wav
> % ffmpeg -i encoded.m4a destination.wav
>
> # observe the sample count != 96000
> % soxi -s destination.wav
> 96256
> }}}
>
> Using boxdumper from l-smash, I can verify that ffmpeg correctly added an
> edit list box:
>
> {{{
> [edts: Edit Box]
> position = 845
> size = 36
> [elst: Edit List Box]
> position = 853
> size = 28
> version = 0
> flags = 0x000000
> entry_count = 1
> entry[0]
> segment_duration = 2000
> media_time = 1024
> media_rate = 1.000000
> }}}
>
> Additionally, there's a media header box with a duration set to 97024 -
> and subtracting the 1024 from the edit list box yields 96000:
>
> {{{
> [mdhd: Media Header Box]
> position = 889
> size = 32
> version = 0
> flags = 0x000000
> creation_time = UTC 1904/01/01, 00:00:00
> modification_time = UTC 1904/01/01, 00:00:00
> timescale = 48000
> duration = 97024 (00:00:02.021)
> language = und
> pre_defined = 0x0000
> }}}
>
> and the Decoding Time to Sample Box also adds up to 97024 - 94 samples at
> 1024 frames and 1 sample at 768 frames.
>
> {{{
> [stts: Decoding Time to Sample Box]
> position = 1140
> size = 32
> version = 0
> flags = 0x000000
> entry_count = 2
> entry[0]
> sample_count = 94
> sample_delta = 1024
> entry[1]
> sample_count = 1
> sample_delta = 768
> }}}
>
> ffmpeg version info:
> {{{
> ffmpeg version n6.0 Copyright (c) 2000-2023 the FFmpeg developers
> built with gcc 13.1.1 (GCC) 20230429
> configuration: --prefix=/usr --disable-debug --disable-static
> --disable-stripping --enable-amf --enable-avisynth --enable-cuda-llvm
> --enable-lto --enable-fontconfig --enable-gmp --enable-gnutls --enable-
> gpl --enable-ladspa --enable-libaom --enable-libass --enable-libbluray
> --enable-libbs2b --enable-libdav1d --enable-libdrm --enable-libfreetype
> --enable-libfribidi --enable-libgsm --enable-libiec61883 --enable-libjack
> --enable-libjxl --enable-libmfx --enable-libmodplug --enable-libmp3lame
> --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-
> libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse
> --enable-librav1e --enable-librsvg --enable-libsoxr --enable-libspeex
> --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora
> --enable-libv4l2 --enable-libvidstab --enable-libvmaf --enable-libvorbis
> --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265
> --enable-libxcb --enable-libxml2 --enable-libxvid --enable-libzimg
> --enable-nvdec --enable-nvenc --enable-opencl --enable-opengl --enable-
> shared --enable-version3 --enable-vulkan
> libavutil 58. 2.100 / 58. 2.100
> libavcodec 60. 3.100 / 60. 3.100
> libavformat 60. 3.100 / 60. 3.100
> libavdevice 60. 1.100 / 60. 1.100
> libavfilter 9. 3.100 / 9. 3.100
> libswscale 7. 1.100 / 7. 1.100
> libswresample 4. 10.100 / 4. 10.100
> libpostproc 57. 1.100 / 57. 1.100
> }}}
New description:
It seems like ffmpeg is properly removing the front padding from audio in
mp4 files, but doesn't account for the end padding added to audio frames
to round up to the frame length.
This is signaled by the mp4 file listing a different duration for the
final sample - either via the Decoding Time to Sample Box box, or for
fragmented mp4s, the sample duration in the track fragment run box.
How to reproduce:
{{{
% ffmpeg -f lavfi -i anullsrc=r=48000:d=2 source.wav
# verify the created audio file as exactly 96000 samples
% soxi -s source.wav
96000
# encode to aac
% ffmpeg -i source.wav -c:a aac encoded.m4a
# decode back to wav
% ffmpeg -i encoded.m4a destination.wav
# observe the sample count != 96000
% soxi -s destination.wav
96256
}}}
Using boxdumper from l-smash, I can verify that ffmpeg correctly added an
edit list box that lists total media duration, as well as the samples to
trim from the beginning of the audio (the encoder delay):
{{{
[edts: Edit Box]
position = 845
size = 36
[elst: Edit List Box]
position = 853
size = 28
version = 0
flags = 0x000000
entry_count = 1
entry[0]
segment_duration = 2000
media_time = 1024
media_rate = 1.000000
}}}
The Decoding Time to Sample Box specifies the final sample is 768 frames.
Doing the math: (94 samples * 1024 frames) + 768 = 97024 frames. Subtract
the 1024 frames from the previous Edit List Box and you should have 96000
samples.
{{{
[stts: Decoding Time to Sample Box]
position = 1140
size = 32
version = 0
flags = 0x000000
entry_count = 2
entry[0]
sample_count = 94
sample_delta = 1024
entry[1]
sample_count = 1
sample_delta = 768
}}}
I think the issue may be the MP4 demuxer not signaling the final decoded
packet's duration. This occurs if I use other codecs as well, for example
mp3:
{{{
# using the same source.wav as above that's 96000 samples:
% ffmpeg -i source.wav -c:a libmp3lame encoded-in-mp3.mp4
% ffmpeg -i encoded-in-mp3.mp4 decoded-from-mp3.wav
% soxi -s decoded-from-mp3.wav
96815
}}}
Here's the edts box and stts from encoded-in-mp3.mp4:
{{{
[edts: Edit Box]
position = 32900
size = 36
[elst: Edit List Box]
position = 32908
size = 28
version = 0
flags = 0x000000
entry_count = 1
entry[0]
segment_duration = 2000
media_time = 1105
media_rate = 1.000000
[stts: Decoding Time to Sample Box]
position = 33205
size = 32
version = 0
flags = 0x000000
entry_count = 2
entry[0]
sample_count = 84
sample_delta = 1152
entry[1]
sample_count = 1
sample_delta = 337
}}}
So again doing some math: (84 samples * 1152 frames) + 337 frames = 97105
frames. Subtract the 1105 frames from the edit list - 96000 frames.
Another example with opus:
{{{
% ffmpeg -i source.wav -c:a libopus encoded-in-opus.mp4
% ffmpeg -i encoded-in-opus.mp4 decoded-from-opus.wav
% soxi -s decoded-from-opus.wav
96648
}}}
Same issue with a fragmented mp4 - which doesn't have the Decoding Time to
Sample Box and instead relies on either the Track Fragment Header Box or
the Track Fragment Run Box for sample duration signaling.
This does not seem to apply to codecs that carry their own duration
signaling, like FLAC in mp4.
ffmpeg version info:
{{{
ffmpeg version n6.0 Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 13.1.1 (GCC) 20230429
configuration: --prefix=/usr --disable-debug --disable-static --disable-
stripping --enable-amf --enable-avisynth --enable-cuda-llvm --enable-lto
--enable-fontconfig --enable-gmp --enable-gnutls --enable-gpl --enable-
ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b
--enable-libdav1d --enable-libdrm --enable-libfreetype --enable-libfribidi
--enable-libgsm --enable-libiec61883 --enable-libjack --enable-libjxl
--enable-libmfx --enable-libmodplug --enable-libmp3lame --enable-
libopencore_amrnb --enable-libopencore_amrwb --enable-libopenjpeg
--enable-libopenmpt --enable-libopus --enable-libpulse --enable-librav1e
--enable-librsvg --enable-libsoxr --enable-libspeex --enable-libsrt
--enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libv4l2
--enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx
--enable-libwebp --enable-libx264 --enable-libx265 --enable-libxcb
--enable-libxml2 --enable-libxvid --enable-libzimg --enable-nvdec
--enable-nvenc --enable-opencl --enable-opengl --enable-shared --enable-
version3 --enable-vulkan
libavutil 58. 2.100 / 58. 2.100
libavcodec 60. 3.100 / 60. 3.100
libavformat 60. 3.100 / 60. 3.100
libavdevice 60. 1.100 / 60. 1.100
libavfilter 9. 3.100 / 9. 3.100
libswscale 7. 1.100 / 7. 1.100
libswresample 4. 10.100 / 4. 10.100
libpostproc 57. 1.100 / 57. 1.100
}}}
--
Comment:
Discovered this isn't limited to just AAC - I think it may apply to any
codec that relies on the mp4 file to signal the last sample's duration
(tested with the native aac encoder, libmp3lame, and libopus). I've
updated the bug title and description accordingly.
--
Ticket URL: <https://trac.ffmpeg.org/ticket/10458#comment:2>
FFmpeg <https://ffmpeg.org>
FFmpeg issue tracker
More information about the FFmpeg-trac
mailing list