Preface - Teasers - Enhanced Terminology - Reference - Encoding of DVD & Bluray Content - About Audio - Recovering The Camera Shots
Basic Primitives - Pulldown Primitives - Advanced Interpolations - Seen In The Wild, Repairing Video


48KHz[x1.001]48KHz is the audio found in cinema-at-24'fps & cinema-at-24'fps-soft & cinema-at-30'fps. It has been upsampled by x1.001. For example, for cinema-at-24'fps audio:
                   <-----------------41.666[6..] seconds------------------>                             ...shooting time
 cinema pictures: (#1_____________________)(#2_ .. (#1000__________________)                            ...24pps
    cinema audio:  <-----2000000 samples play in 41.666[6..] seconds------>                             ...shooting audio

                   <-----------------41.708[3..] seconds------------------------------------------->    ...playing time
  encoded frames: [#1_____________________][#2_ .. [#999___________________][#1000__________________]   ...24'fps
   encoded audio:  <-----2002000 samples play in 41.708[3..] seconds------------------------------->    ...x1.001 upsampled facsimile of shooting audio [note 1]
If 2000000 audio samples are upsampled to 2002000 samples (as though shifted from 48.048KHz to 48KHz), what happens to tonality? Tonality goes flat by just 1.7 cents [note 2] (very little, but apparently enough to make musicians unhappy). What follows uses the above diagrams to explain why. Video has been slowed down (for example, from 41.666 seconds to 41.708 seconds) by simply extending PTSs by x1.001 (for example, from 3750 ticks per frame to 3753.75 ticks per frame). Images move 0.1% slower, mouths work 0.1% slower. It's simple and easy. But audio is more complicated. Audio has additional requirements. Audio PTSs (i.e. the packet times) are also extended of course, otherwise, audio buffers would eventually overflow provoking momentary but fairly regularly occurring audio skips. And the actual audio (i.e. the sound) has been lengthened (for example, from 41.666 seconds to 41.708 seconds, and therefore pitched flat), otherwise, the actual audio would be out of sync -- it would lead the video by 3.6 seconds per hour of play. And the lengthened audio has been metadata tagged "48KHz", otherwise, the audio might not play at all. Some people find it confusing that a video slowdown requires upsampled audio but the preceeding explanation hopefully helps.

[note 1] Most tools that can dial audio PTSs forward or back in time by an x-factor (for example, MKVToolNix) will automatically resample the audio to match.

[note 2] For musicians: 1.001 upsampled audio is flat by 1.7 cents (i.e. x = -0.017 -- the solution of 2^(x/12) = 1/1.001).

48KHz[x/1.001]48KHz reverses 48KHz[x1.001]48KHz. It is applied by cinema-from-24'fps & cinema-from-30'fps-soft & cinema-from-30'fps-hard to recover a facsimile of the shooting audio.
Implementation example: ffmpeg -i SOURCE -filter_complex "[0:a]atempo=1.001[a]" -map 0:v -map "[a]" -codec:v hevc -codec:a ac3 -r 24 TARGET.
                   <-----------------41.708[3..] seconds------------------------------------------->    ...playing time
  encoded frames: [#1_____________________][#2_ .. [#999___________________][#1000__________________]   ...24'fps
   encoded audio:  <-----2002000 samples play in 41.708[3..] seconds------------------------------->    ...upsampled audio

                   <-----------------41.666[6..] seconds------------------>                             ...shooting time
 cinema pictures: (#1_____________________)(#2_ .. (#1000__________________)                            ...24pps
    cinema audio:  <-----2000000 samples play in 41.666[6..] seconds------>                             ...x/1.001 downsampled facsimile of upsampled audio [note 1]
[note 1] Most tools that can dial audio PTSs forward or back in time by an x-factor (for example, MKVToolNix) will automatically resample the audio to match.

48KHz[x0.96]48KHz is the audio found in cinema-at-25fps-forced. It has been downsampled by x0.96.
                   <----------------------41.666[6..] seconds-------------------------------------->    ...shooting time
 cinema pictures: (#1_____________________)(#2_ .. (#23____________________)(#24____________________)   ...24pps
    cinema audio:  <----------2000000 samples play in 41.666[6..] seconds-------------------------->    ...shooting audio

                   <----------------------40 seconds----------------------->                            ...playing time
  encoded frames: [#1____________________][#2__ .. _][#24___________________]                           ...25fps
   encoded audio:  <----------1920000 samples play in 40 seconds----------->                            ...x0.96 downsampled facsimile of shooting audio [note 1]
If 2000000 audio samples are downsampled to 1920000 samples (as though shifted from 46.08KHz to 48KHz), what happens to tonality? It is well known and easily discerned that audio tonality goes sharp by 70.7 cents [note 2] (nearly a major seventh, from C to nearly C# for example). What follows uses the above diagrams to explain why. Video has been sped up (for example, from 41.666 seconds to 40 seconds) by simply reducing PTSs by x0.96 (for example, from 3750 ticks per frame to 3600 ticks per frame). Images move 4% faster, mouths work 4% faster. It's simple and easy. But audio is more complicated. Audio has additional requirements. Audio PTSs (i.e. the packet times) are also reduced of course, otherwise, audio buffers would underflow provoking momentary but fairly regularly occurring audio dropouts. And the actual audio (i.e. the sound) has been shortened (for example, from 41.666 seconds to 40 seconds, and therefore pitched sharp), otherwise, the actual audio would be out of sync -- it would lag the video by 2 minutes 24 seconds per hour of play. And the shortened audio has been metadata tagged "48KHz", otherwise, the audio might not play at all. Some people find it confusing that a video speedup requires downsampled audio but the preceeding explanation hopefully helps.

[note 1] Most tools that can dial audio PTSs forward or back in time by an x-factor (for example, MKVToolNix) will automatically resample the audio to match.

[note 2] For musicians: x0.96 downsampled audio is sharp by 70.7 cents (i.e. x = 0.707 -- the solution of 2^(x/12) = 1/0.96).

48KHz[x/0.96]48KHz reverses 48KHz[x0.96]48KHz. It is applied by cinema-from-25fps-forced to recover a facsimile of the shooting audio.
Implementation example: ffmpeg -i SOURCE -filter_complex "[0:v]settb=expr=1/24,setpts=expr=N[v],[0:a]atempo=0.96[a]" -map "[v]" -map "[a]" -codec:v hevc -codec:a ac3 -dn -r 24 TARGET
                   <----------------------40 seconds----------------------->                            ...playing time
  encoded frames: [#1____________________][#2__ .. _][#24___________________]                           ...25fps
   encoded audio:  <----------1920000 samples play in 40 seconds----------->                            ...downsampled audio

                   <----------------------41.666[6..] seconds-------------------------------------->    ...shooting time
 cinema pictures: (#1_____________________)(#2_ .. (#23____________________)(#24____________________)   ...24pps
    cinema audio:  <----------2000000 samples play in 41.666[6..] seconds-------------------------->    ...x/0.96 upsampled facsimile of downsampled audio [note 1]
[note 1] Most tools that can dial audio PTSs forward or back in time by an x-factor (for example, MKVToolNix) will automatically resample the audio to match.