[FFmpeg-devel] [PATCH] AVCHD/H.264 parser: determination of frame type, question about timestamps

Mon Jan 26 00:13:13 CET 2009

Michael Niedermayer wrote:
> On Fri, Jan 23, 2009 at 10:46:39PM +0100, Ivan Schreter wrote:
>   
>>>> +    while(buf<buf_end){
>>>> +        buf= ff_find_start_code(buf, buf_end, &state);
>>>> +        init_get_bits(&h->s.gb, buf, 8*(buf_end - buf));
>>>> +        switch(state & 0x1F){
>>>> [...]
>>>> +        case NAL_IDR_SLICE:
>>>> +        case NAL_SLICE:
>>>> +            get_ue_golomb(&h->s.gb);
>>>> +            s->pict_type= golomb_to_pict_type[get_ue_golomb(&h->s.gb) % 5];
>>>>     
>>>>         
>>> IIRC that should be get_ue_golomb_31() also its missing some check against
>>> negative values
>>>       
>> I don't believe so. First, slice_type is defined in H.264 standard in 
>> 7.3.3 (Slice header syntax) being ue(v), so not limited to 31. If I 
>>     
> this reasoning is flawed, to proof something is allowed you need to find
> an explicit statement that it is allowed or proof the non existence of a
> statement making it invalid.
> Showing that a single line of text at some point doesnt outright contradict
> it is not enough.
>   
Answer see below.
>> understand golomb.h correctly, this corresponds to get_ue_golomb(). Of 
>> course, since there are just 5 slice types, this value should be small, 
>> but some broken encoder could write something else in there. Do we want 
>> to live with it (and save just a single if per picture)? Further, it is 
>> unsigned, so it can't be negative. And modulo 5 makes it always being in 
>> the range 0..4, as in the array.
>>     
> -1 % 5
>   
You contradict yourself. On the one side you say get_ue_golomb_31() must 
be enough (yes, it is, since legal values for slice_type are 0..9), OTOH 
you want to check it for non-negative (_although_ legal values are in 
range 0..9, i.e., non-negative).

BTW, what I also find funny, get_ue_golomb() functions return int 
instead of unsigned int, although internal buffer is unsigned int. 
Similarly, set_ue_golomb() functions take int instead of unsigned int as 
argument. Maybe this should be corrected as well (and save a few checks 
and warnings in the process).

But this is peanuts compared to the really important stuff below...
> [...]
>   
>> I believe the only way is to use recovery point SEI message as 
>> documented in H.264 spec Chapter D.2.7 and count down recovery_frame_cnt 
>> frames before issuing a "key frame" in order for seeking to work 
>>     
>
> no, the keyframe flag must be set at the point of the SEI not after it
> you also should set convergence_duration
> Still even if we do just part of this its already a step forward, important
> is that we move toward the correct solution instead of adding random hacks.
>   
That's the question. If we set the key flag at SEI recovery point, how 
do you want to seek to proper place in an application using libavformat 
and not knowing anything about convergence_duration (at the moment all 
applications I checked)? As long as any application doesn't handle 
convergence_duration, it will be broken with H.264.

Anyway, even if the application did handle it, it's extremely hard to 
use interface (especially, since only a few codecs use it, so also hard 
to debug on application side).

For instance, imagine a video editing application (like kdenlive) needs 
to seek to frame X. There is a SEI recovery point at X-3 with 
convergence_duration == 5. How can the application now get to frame X? 
It asks to seek to X, which will position the stream at key frame at 
position X-3. However, in order to synchronize decoding, it has to move 
all the way up to frame X+2, which is not what the application wanted.

That's why IMHO the key frame should be SEI recovery point frame + 
convergence_duration and the decoder should take care of  initializing 
whatever internal buffers it has to be able to start decoding from this 
frame onwards. Taking the previous example and assuming another SEI 
recovery point at X-8 with convergence_duration 6, the decoder could 
seek to position X-8, decode reference frames up to X-2 and declare X-2 
as key frame. Since internal buffers are already filled, decoding X-2 
from the side of the application as key frame succeeds and it can 
advance further two frames to X to decode frame X correctly. And it 
doesn't have to take care of convergence_duration at all by itself 
(i.e., convergence_duration is ffmpeg-internal).

With new seeking API you proposed, handling of convergence duration 
(internal or by application) could be selected per flag.

BTW, I opened a new thread, since this thread is a little off-topic now. 
Please answer in the new thread, if possible.

> [...]
>> which is IMHO broken. Of course, we could communicate with it by setting 
>> pict_type to FF_I_TYPE for keyframes only (IDR frames and frames after 
>> recovery point), for other frames containing I- and P-slices to 
>> FF_P_TYPE and for B-frames to FF_B_TYPE. But I don't like it much. Any 
>> idea, how to do it correctly without the need to touch other codecs?
>>     
>
> pict_type from the parse context should likely be split in pict_type and
> keyframe
>   
Actually, we already have a flag field on AVPacket coming from the 
parser, but compute_pkt_fields() doesn't believe it and resets it based 
on pict_type from parse context instead.

>> Further, I found out that av_read_frame() sometimes reads only one field 
>> of a frame (especially in interlaced mpegts AVCHD files), which confuses 
>> the rest of ffmpeg. I suppose, this is a bug in h264_parse(), which 
>> returns a single field instead of whole frame (when a frame is coded as 
>> two fields in two pictures), but I didn't find a way yet, how to address 
>> the problem. Any idea?
>>     
> keep in mind that AFIAK h264 allows unpaired fields that is
> 1 frame, 1 field, 1 frame
> I have no samples that actually does this so we might get away with ignoring
> it but if we choose to not support it we should have some argument why
> it would be too complex ..
>   
Umm... You are referring to D.2.2 (picture timing semantics). I don't 
believe your example is possible. 1 frame, 1 field (top), 1 field 
(bottom), 1 frame is, but not with just a single field between two full 
frames (except for error situations, like bad data coming in over 
network). If I understand it correctly, frames are always a whole 
picture with both fields having same timestamp (at least it is implied 
by table D-1 by having NumClockTS=1 for frames). In your case, there 
would be missing second field for the second frame.

There is, however, a hypothetical possibility of a similar problem, 
i.e., having picture with 1 frame, picture with 1 top field, followed by 
picture coding 1 bottom field (of the second frame) and 1 top field (of 
the third frame), followed by picture with 1 bottom field (of the third 
frame). But I personally believe this will be a very very border case, 
if used by someone at all. It is very unnatural.

Of course, we still have the problem of frame doubling/tripling and 
having 3 fields per picture, eventually with one of them repeated 
(pic_struct codes 5-8).

BTW, the reply would probably again make sense in the new thread.

FYI, I started adding some additional constants to the h264 code for 
other SEI messages and for parsing SEI recovery point as the beginning, 
as this will be needed in all cases.

Thanks & regards,

Ivan