Preface - Teasers - Enhanced Terminology - Reference - Encoding of DVD & Bluray Content - About Audio - Recovering The Camera Shots
Basic Primitives - Pulldown Primitives - Advanced Interpolations - Seen In The Wild, Repairing Video

Video Notation, A Video Lingua Franca,
by Mark Filipak, February 2022.

I humbly submit this notational schema for comment. I've worked on it for about a year. It has evolved from complex and capable of simple things to simple and seemingly capable of anything.

The project has taken nearly a year because it took me nearly a year to learn how to stop trying to force the notation into some preconceived form and, instead, to follow the path to where the notation was trying to take me.

My motivation: I found that even video pros could not adequately describe video structures in discussions. As though writing in a private language, they related what they did to videos in terms of the DaVinci or Final Cut controls they used but could not explain what the controls actually did or how the video structures were actually manipulated.

In a nutshell, the notation is a structure definition for video. There's import structure and export structure. How the import is transformed to make the export is up to the implementation. But it's a short walk from a notation that's descriptive to a notation that's prescriptive and I hope that an implementation will follow in the future.

The notation supports objects (frames, fields, pictures, halfpictures, scans) within strides, within streams. Operations on the objects are purely object-oriented because they are implied by the objects' relationships, not by explicit functions. For example, "fps[pps]" implies a relationship between 2 objects: fps (a frame-per-second stream) on the left, and pps (a picture-per-second stream) on the right. Due to their proximity and the brackets around "pps" (which suggests framing), "fps[pps]" implies decoding. Likewise, "[pps]fps" implies encoding.

The overall notational approach is to start with videos as they come out of cameras: as pps or sps (scan-per-second) streams, and then, to notationally describe how the videos were manipulated to put them on DVDs and Blurays ("Encoding of DVD & Bluray Content"), and then, to notationally describe how to manipulate the DVD and Bluray videos to recover the original camera shots ("Recovering The Camera Shots"), and then, to notationally describe how to manipulate those recovered camera shots to get what you want.

The notational system is quite simple. It employs just 62 tokens:
A-Z   a-z   0-9
4 smart delimiters:
[   ]   (   )
3 operators:
..   __   §
and 2 aliases:
^   $
There are a few rules and a few conventions but I believe them all to be sensible and more or less intuitive.

Unfortunately, I've not figured out how to get readers a little bit 'wet', then a little more 'wet', then a little more, etc., so what follows is a 'dive' into the deep end of the 'pool'. I hope you find the 'swim' refreshing.

About the diagrams: The diagrams shown are not part of the notation, but they make it easier (for me, and hopefully for readers) to visualize what the notations 'say'. What follows is for people who haven't experienced timing diagrams.

Imagine you type "HELLO" in 2+1/2 seconds and that you take a screen shot with each keystroke.
    ______________     ______________     ______________     ______________     ______________
   |H             |   |HE            |   |HEL           |   |HELL          |   |HELLO         |
   |              |   |              |   |              |   |              |   |              |
   |              |   |              |   |              |   |              |   |              |
   |              |   |              |   |              |   |              |   |              |
   |______________|   |______________|   |______________|   |______________|   |______________|
    screen shot #1     screen shot #2     screen shot #3     screen shot #4     screen shot #5
Imagine that you then make a movie of the screen shots. The following diagram
[H___________][HE__________][HEL_________][HELL________][HELLO_______]
uses symbols: [H___________] [HE__________] [HEL_________] [HELL________] [HELLO_______], to represent the 5 screen shots as frames. Time runs left to right. Underline characters are added to the symbols to equalize their durations and to enhance their presentation.
                 +----------------------------2.5 seconds----------------------------->
import pictures: (H___________)(HE__________)(HEL_________)(HELL________)(HELLO_______)   ...2pps

NOTATION: [2pps]2fps

                 +----------------------------2.5 seconds----------------------------->
  export frames: [H___________][HE__________][HEL_________][HELL________][HELLO_______]   ...2fps [note 1]
[note 1] The actual key closures occur at the [ moments rather than throughout the [____________] times. Those moments are denoted by PTSs (presentation time stamps) in metadata for each frame.
                 +---------------------------------2.5 seconds---------------------------------->
import pictures: (H_____________)(HE____________)(HEL___________)(HELL__________)(HELLO_________)   ...2pps [note 2]

NOTATION: [2pps]4fps

                 +-------------1.25 seconds------------->
  export frames: [H_____][HE____][HEL___][HELL__][HELLO_]                                           ...4fps [note 3]
[note 2] Notice that this diagram's '2.5 seconds' and the prior diagram's '2.5 seconds' look different. They are the same 2.5 seconds, but their seconds-per-texipix scales differ. Time scales are generally not shown in diagrams. Instead, per-second rates are denoted. If needed, texipix characters can be counted and the counts used to calculate seconds-per-texipix but seconds-per-texipix is usually unimportant.

[note 3] Because the camera shots were at 2pps and because the frame rate is 4fps and because there's one picture per frame and because only frames have PTSs, the movie is played back with a x2 speedup.

About strides: Video streams are processed in units called 'strides'. A stride is the shortest repeating token pattern that correctly characterizes the structure of a particular stream. For example, "2-3 pulldown" is often seen in the literature but that notation is incorrect. The pulldown pattern should be "2-3-2-3" as demonstrated by what follows.
import pictures: (A+a_______________)(B+b_______________)(C+c_______________)(D+d_______________)

NOTATION: (AaBb)(AaBbB)   ...2-3 pulldown

                 +--------------1st stride-------------->+--------------2nd stride-------------->
import halfpics: (A_______)(a_______)(B_______)(b_______)(C_______)(c_______)(D_______)(d_______)
export halfpics: (A_____)(a_____)(B_____)(b_____)(B_____)(C_____)(c_____)(D_____)(d_____)(D_____)
export pictures: (A+a___________)(B+b___________)(B+C___________)(c+D___________)(d+D___________)
(AaBb)(AaBbB) produces 2, 5 halfpic strides, but the 2nd stride is wrong because the 3rd picture weaves 2 top halfpics and any encoded frames will have wrong 'top_field_first' metadata for their 4th & 5th frames, whereas...
import pictures: (A+a_______________)(B+b_______________)(C+c_______________)(D+d_______________)

NOTATION: (Aa-Dd)(AaBbBcCdDd)   ...2-3-2-3 pulldown

                 +----------------------------------1st stride---------------------------------->
import halfpics: (A_______)(a_______)(B_______)(b_______)(C_______)(c_______)(D_______)(d_______)
export halfpics: (A_____)(a_____)(B_____)(b_____)(B_____)(c_____)(C_____)(d_____)(D_____)(d_____)
export pictures: (A+a___________)(B+b___________)(B+c___________)(C+d___________)(D+d___________)
(Aa-Dd)(AaBbBcCdDd) produces a single 10 halfpic stride that is correct. (Aa-Dd)(AaBbBcCdDd) is therefore the correct pulldown pattern because it produces the shortest correct stride.