[FFmpeg-devel] [PATCH 16/21] libavcodec/avcodec, libavformat/movenc: support embedding channel layout to stream side data

Wed Aug 24 17:39:53 EEST 2016

On 08/23/2016 08:49 PM, Nicolas George wrote:
>> +    int speaker_position; /** an OutputChannelPosition from ISO/IEC 23001-8 */
>
> I think most our API users and even developers have better use for 158 Swiss
> Francs than feeding the ISO. Please make the API self-sufficient; I suggest
> an enum.

I'll introduce an enumeration AVSpeakerPosition. However, the spec is 
available publicly at 
http://standards.iso.org/ittf/PubliclyAvailableStandards/c062088_ISO_IEC_23001-8_2013.zip 
. (And the table of output channel positions starts at page 24.)

> Le septidi 7 fructidor, an CCXXIV, erkki.seppala.ext at nokia.com a écrit :
>> From: Erkki Seppälä <erkki.seppala.ext at nokia.com>
>>
>> Added support for passing complex channel layout configuration as side
>> packet data (AV_PKT_DATA_AUDIO_CHANNEL_LAYOUT,
>> AV_PKT_DATA_AUDIO_CHANNEL_PREDEFINED_LAYOUT,
>> AV_PKT_DATA_AUDIO_CHANNEL_LAYOUT_OBJECT_STRUCTURED) to ISO media files
>> as specified by ISO/IEC 14496-12.
>>
>> This information isn't integrated into the existing channel layout
>> system though,
>
> I think it needs to be.

Unfortunately properly integrating this information into FFmpeg sounds 
larger than the scope of my project. I imagine this still would be very 
helpful base work should such information be actually supported by 
various parts of FFmpeg.

A special bit in the AV_CH_LAYOUT_* could be used for indicating that a 
special layout exists, but that information by itself is nowhere enough 
to work with this data with ie. in the functions available in 
channel_layout.c. And if codec or client code is able to deal with the 
advanced channel layout configurations, they also have access to this 
side data.

I also believe is is easier to study more deeper integration once a 
non-intrusive possibility to work with advanced ISOBMFF channel layouts 
exists.

Here is how the mapping on speaker position level might be done:

/** Speaker positions according to ISO/IEC 23001-8 */
enum AVSpeakerPosition {
!    AV_SPEAKER_POSITION_L      = 0,   /// Left front
!   AV_SPEAKER_POSITION_R      = 1,   /// Right front
     AV_SPEAKER_POSITION_C      = 2,   /// Centre front
     AV_SPEAKER_POSITION_LFE    = 3,   /// Low frequency enhancement
     AV_SPEAKER_POSITION_LS     = 4,   /// Left surround
     AV_SPEAKER_POSITION_RS     = 5,   /// Right surround
!   AV_SPEAKER_POSITION_LC     = 6,   /// Left front centre
!   AV_SPEAKER_POSITION_RC     = 7,   /// Right front centre
     AV_SPEAKER_POSITION_LSR    = 8,   /// Rear surround left
     AV_SPEAKER_POSITION_RSR    = 9,   /// Rear surround right
     AV_SPEAKER_POSITION_CS     = 10,  /// Rear centre
     AV_SPEAKER_POSITION_LSD    = 11,  /// Left surround direct
     AV_SPEAKER_POSITION_RSD    = 12,  /// Right surround direct
     AV_SPEAKER_POSITION_LSS    = 13,  /// Left side surround
     AV_SPEAKER_POSITION_RSS    = 14,  /// Right side surround
!    AV_SPEAKER_POSITION_LW     = 15,  /// Left wide front
!    AV_SPEAKER_POSITION_RW     = 16,  /// Right wide front
     AV_SPEAKER_POSITION_LV     = 17,  /// Left front vertical height
     AV_SPEAKER_POSITION_RV     = 18,  /// Right front vertical height
     AV_SPEAKER_POSITION_CV     = 19,  /// Centre front vertical height
     AV_SPEAKER_POSITION_LVR    = 20,  /// Left surround vertical height 
rear
     AV_SPEAKER_POSITION_RVR    = 21,  /// Right surround vertical 
height rear
     AV_SPEAKER_POSITION_CVR    = 22,  /// Centre vertical height rear
     AV_SPEAKER_POSITION_LVSS   = 23,  /// Left vertical height side 
surround
     AV_SPEAKER_POSITION_RVSS   = 24,  /// Right vertical height side 
surround
     AV_SPEAKER_POSITION_TS     = 25,  /// Top centre surround
     AV_SPEAKER_POSITION_LFE2   = 26,  /// E2 Low frequency enhancement 2
     AV_SPEAKER_POSITION_LB     = 27,  /// Left front vertical bottom
     AV_SPEAKER_POSITION_RB     = 28,  /// Right front vertical bottom
     AV_SPEAKER_POSITION_CB     = 29,  /// Centre front vertical bottom
     AV_SPEAKER_POSITION_LVS    = 30,  /// Left vertical height surround
     AV_SPEAKER_POSITION_RVS    = 31,  /// Right vertical height surround
                                       /// 32-45 Reserved
     AV_SPEAKER_POSITION_LFE3   = 36,  /// E3 Low frequency enhancement 3
     AV_SPEAKER_POSITION_LEOS   = 37,  /// Left edge of screen
     AV_SPEAKER_POSITION_REOS   = 38,  /// Right edge of screen
     AV_SPEAKER_POSITION_HWBCAL = 39,  /// half-way between centre of 
screen and left edge of screen
     AV_SPEAKER_POSITION_HWBCAR = 40,  /// half-way between centre of 
screen and right edge of screen
     AV_SPEAKER_POSITION_LBS    = 41,  /// Left back surround
     AV_SPEAKER_POSITION_RBS    = 42,  /// Right back surround
                                       /// 43–125 Reserved
     AV_SPEAKER_POSITION_EXPL   = 126, /// Explicit position (see text)
                                       /// 127 Unknown / undefined
};

I suppose if this is preferred, then one for 23001-8 channel layouts 
would be preferred as well..

It does seem confusing though that similar #defines are available in 
channel_layout.h and I am a bit uncertain of a 1:1 mapping between the 
two. I've marked in above with ! the ones that I found a match from  the 
defines:

AV_CH_FRONT_LEFT             AV_SPEAKER_POSITION_L
AV_CH_FRONT_RIGHT            AV_SPEAKER_POSITION_R
AV_CH_FRONT_CENTER           AV_SPEAKER_POSITION_LC
AV_CH_LOW_FREQUENCY          AV_SPEAKER_POSITION_LFE
AV_CH_BACK_LEFT              AV_SPEAKER_POSITION_LBS
AV_CH_BACK_RIGHT             AV_SPEAKER_POSITION_RBS
AV_CH_FRONT_LEFT_OF_CENTER   AV_SPEAKER_POSITION_LC
AV_CH_FRONT_RIGHT_OF_CENTER  AV_SPEAKER_POSITION_RC
AV_CH_BACK_CENTER
AV_CH_SIDE_LEFT              AV_SPEAKER_POSITION_LSS
AV_CH_SIDE_RIGHT             AV_SPEAKER_POSITION_RSS
AV_CH_TOP_CENTER             AV_SPEAKER_POSITION_TS
AV_CH_TOP_FRONT_LEFT
AV_CH_TOP_FRONT_CENTER
AV_CH_TOP_FRONT_RIGHT
AV_CH_TOP_BACK_LEFT
AV_CH_TOP_BACK_CENTER
AV_CH_TOP_BACK_RIGHT
AV_CH_STEREO_LEFT
AV_CH_STEREO_RIGHT
AV_CH_WIDE_LEFT              AV_SPEAKER_POSITION_LW
AV_CH_WIDE_RIGHT             AV_SPEAKER_POSITION_RW
AV_CH_SURROUND_DIRECT_LEFT   AV_SPEAKER_POSITION_LSD
AV_CH_SURROUND_DIRECT_RIGHT  AV_SPEAKER_POSITION_RSD
AV_CH_LOW_FREQUENCY_2        AV_SPEAKER_POSITION_LFE2

If this approach were taken, then the missing speakers positions 
(channels?) would need to be added, conversion maps for 23001-8 would 
need to be implemented (and probably refusing if there are non-mappable 
positions). Then layout bitmaps would be mapped into 23001-8 ones or if 
the layout cannot be found, a custom one generated.. Personally I don't 
mind how separate&simple this system still is :).

>> 		 which is much more restricted compared to what the
>> standard permits.
>
> Certainly, but the bitfield channel layout system is enough for most cases
> and well tested. Any new system must work with it, not around it.

But as you may agree, such a modification would not be small. Certainly 
not one I would try as my first feature to FFmpeg.

> Please follow the surrounding style for Doxygen comments: /** and */ on a
> separate line, * at the beginning of each line.

Oops. I'll fix those.

>> +    /** The following are used if speaker_position == 126 */
>> +    int azimuth;
>> +    int elevation;
>
> Please document the units and references.

Done.

> Also, since it is repeated and packed, using a small type would make sense
> here if possible.

I'll use the same types as in the headers: int16_t and int8_t.

>> +} AVAudioTrackChannelPosition;
>> +
>> +/** Describes the channel layout (ie. speaker position) of a single audio track */
>> +typedef struct AVAudioTrackChannelLayout {
>> +    int nb_positions;
>
>> +    /** followed by: AVAudioTrackChannelPosition positions[nb_positions]; */
>
> AVAudioTrackChannelPosition positions[0];
>
> That way, computing the address is taken care of by the compiler without
> tricky pointer arithmetic, including alignment requirements.

Actually this is the way I had it in the first place, but compiler 
warnings convinced me not to do it. If I'll use positions[1] the 
allocations will end up slightly too large unless without slightly 
convoluted code.

Is there a preferred way? It doesn't seem like foo[1] is used anywhere.

> (We already use 0-sized final arrays once in vaapi_encode.h; since VA API is
> for Linux and BSD, it does not probe the crappy proprietary compilers
> support it. In that case, using 1 instead of 0 is fine.)

In particular I don't want to do it conditionally :).

>
>> +} AVAudioTrackChannelLayout;
>> +
>
>> +/** Describes the channel layout based on predefined layout of a single track
>> +    by providing the layout and the list of channels are are omitted */
>
> Could you make that clearer?
>
>> +typedef struct AVAudioTrackChannelPredefinedLayout {
>> +    int layout;        /** ChannelConfiguration from ISO/IEC 23001-8 */
>
>> +    int nb_omitted_channels;
>> +    /** followed by: char omitted_channels[nb_omitted_channels]; - non-zero indicates the channel is omitted */
>
> Is there a reasonable upper bound to nb_omitted_channels?

Yes, ISOBMFF supports only 64 channels for this mask. But perhaps even a 
higher number is usable for some special case scenarios, not involving 
ISOBMFF?

> Also, I think the naming is inconsistent: if nb_omitted_channels = 3 and
> omitted_channels[] = { 1, 0, 1 }, there are only two omitted channels, not
> three.

I agree that it seems a bit inconsistent.. Would an integer array 
enumerating the omitted channels be better? Of course this whole problem 
goes away if the structure is fixed-size (ie. uint8_t 
omitted_channels_bitmask[64 / 8] or the uint64_t you suggested).

>>  enum AVPacketSideDataType {
>>      AV_PKT_DATA_PALETTE,
>>
>> @@ -1556,7 +1585,27 @@ enum AVPacketSideDataType {
>>       * Configured the timed metadata parameters, such as the uri and
>>       * meta data configuration. The key is of type AVTimedMetadata.
>>       */
>> -    AV_PKT_DATA_TIMED_METADATA_INFO
>> +    AV_PKT_DATA_TIMED_METADATA_INFO,
>> +
>> +    /**
>> +     * Channel layout, describing the position of spakers for the
>> +     * channels of a track, following the structure
>
>> +     * AVAudioTrackChannelLayout.
>> +     */
>> +    AV_PKT_DATA_AUDIO_CHANNEL_LAYOUT,
>
> I think it would be a good idea if the name of the side data matches the
> name of the structure as much as possible: Track -> _TRACK_. Ditto for the
> others.
>
> Would it make sense to have a single side data type and an integer field at
> the beginning announcing the type of the rest:
>
> typedef struct AVComplexChannelLayout {
>     enum AVComplexChannelLayoutType type;
>     union {
>         uint64_t bitmask;
>         AVAudioTrackChannelLayout complete;
>         AVAudioTrackChannelPredefinedLayout predefined;
>         AVAudioTrackChannelLayoutObjectStructured object;
>     } layout;
> };

It does seem nice. The actual structs/enums might look like this:

typedef struct AVAudioTrackChannelLayout {
     int nb_positions;

     /** Contains nb_positions entries*/
     AVAudioTrackChannelPosition positions[1];
} AVAudioTrackChannelLayout;

typedef struct AVAudioTrackChannelPredefinedLayout {
     int layout;        /** ChannelConfiguration from ISO/IEC 23001-8 */
     uint64_t omitted_channels;
} AVAudioTrackChannelPredefinedLayout;

typedef enum AVComplexAudioTrackChannelLayoutType {
     AV_COMPLEX_CHANNEL_LAYOUT_PREDEFINED,
     AV_COMPLEX_CHANNEL_LAYOUT_CUSTOM
} AVComplexAudioTrackChannelLayoutType;

typedef struct AVComplexAudioChannelLayout {
     enum AVComplexAudioTrackChannelLayoutType type;
     union {
         AVAudioTrackChannelPredefinedLayout predefined;
         AVAudioTrackChannelLayout complete;
     };
} AVComplexAudioChannelLayout;

enum AVPacketSideDataType {
...

     /**
      * Channel layout, describing the position of speakers for the
      * channels of a track, following the structure
      * AVComplexAudioChannelLayout.
      */
     AV_PKT_DATA_COMPLEX_AUDIO_CHANNEL_LAYOUT,

     /**
      * The channel layout is object structured with the number of 
objects in an int (may accompany AV_PKT_DATA_COMPLEX_AUDIO_CHANNEL_LAYOUT)
      */
     AV_PKT_DATA_COMPLEX_AUDIO_TRACK_CHANNEL_LAYOUT_OBJECT_STRUCTURED,
} AVPacketSideDataType;

I'm not super happy about the long names (though tempted to 
s/AVAudio/AVComplexAudio/), but what is the alternative :).

Thanks for reviewing and feedback!