[FFmpeg-devel] [RFC] H.264/SQV3 separation: h264data.h

Tue Dec 16 01:34:00 CET 2008

On Mon, Dec 15, 2008 at 01:33:49AM +0100, Michael Niedermayer wrote:
> On Sun, Dec 14, 2008 at 03:58:26PM -0800, Jason Garrett-Glaser wrote:
> > > no, the apparent slowdown of cathedral-beta2-400extra-crop-avc.mp4 is not ok
> > 
> > Why is this not ok, while disabling filter_mb_fast was OK?  I measure
> > a 25% performance increase on a sample here by re-enabling
> > filter_mb_fast.  Why is 0.5% here intolerable, but 25% there is just
> > fine?
> 
> One has to weigh the advantage against the disadvantage
> 
> here, advanatge, of the change is largely cosmetic

You consider code maintainability cosmetic?

> disadvantage is a 0.5% speed loss, after 10 such changes our decoder is 5%
> slower.

The separation of the two decoders can only be done once and will only
be done once.  There is absolutely no danger for this to become a
slippery slope.

> And thouse claiming randomness, well, if its random a different gcc version
> shouldnt show the speed loss and in that case it could be reconsidered, though
> again if 4 out of 5 gcc versions show the speed loss it still doesnt look too
> good nor random. But if 2 out of 4 are slower and 2 faster it would be ok.

Do you claim that H.264 decoding performance is not sensitive to random
changes and even to changes unrelated to H.264?

> also ive developed and optimized other codecs like the mpeg4 variants based
> on this concept and they are pretty much the fastest around. Had i applied
> random changes that made the code by 0.5% slower they would not be as
> fast.

Well, our H.264 decoder is not the fastest around and it still has areas
where it can be improved vastly.  So skipping refactoring for 0.3% speed
in one sample - don't forget that the other one improved - is premature
optimization.

But most importantly I think this obsession with speed at all costs is
hurting people's motivation and losing us much more than 0.3%
performance.  I know that I already spent far too much time on this
split without real progress and my motivation is going in one direction
only: down.

There have been tons of changes to h264.c that were not subjected to the
speed inquisition, everything PAFF-related jumps to my mind for example.
The H.264/SVQ3 split is being singled out for unfathomable reasons.
Probably because I made the error of waking sleeping dogs by running
benchmarks in the first place.

Below is a slightly edited IRC log from yesterday night.  It should
speak volumes about developer motivation and places where *substantial*
speed improvements can be achieved.

Diego

00:18 <@Dark_Shikari> DonDiego: while you're at it, split h264.c into meaningful sections so that you don't have all of h264 decoding in one gigantic file
00:18 <@Dark_Shikari> residual decoding should go in cabac and cavlc c files, respectively
00:18 <@DonDiego> Dark_Shikari: why don't you split the file?
00:19 <@DonDiego> you know h.264 much better than i do..
00:19 <@Dark_Shikari> DonDiego: I don't maintain it, and I'm not prepared to go through the nightmare of trying to convince michael
00:19 <@DonDiego> what nightmare?
00:20 <@Dark_Shikari> you just posted above about your annoyance at michael not accepting the SVQ3 split
00:20 <@Dark_Shikari> which is far far far simpler than what I'm describing
00:20 <@Dark_Shikari> you know what I'm talking about
00:20 <@DonDiego> i wanted it spelled out
00:20 <@DonDiego> i don't think michael is opposed to the split you suggest
00:20 <@DonDiego> he knows that h264.c is far too large for its own good
00:21 <@Dark_Shikari> sure, but he'll probably find the split makes things like 0.1% slower
00:21 <@Dark_Shikari> and reject it
00:21 <@DonDiego> just his obsession with performance is insane
00:21 <@DonDiego> well, let me duke that out with him, if i manage to convince him, you can go one step further
00:22 <@Dark_Shikari> well the odd thing about it is that for example x264 is far far far more obsessed with performance than michael's ffh264
00:23 <@Dark_Shikari> ... yet we don't have to massively sacrifice code maintainability to do it
00:23 <@Dark_Shikari> so I'm not exactly sure why he's like this
00:23 <@Dark_Shikari> if he really cared that much he'd focus on actually making it faster..
00:23 <@superdump> more/better simd you mean?
00:24 <@Dark_Shikari> better simd, less overhead
00:24 <@Dark_Shikari> simd-wise, we've already done it for him
00:24 <@superdump> make the point
00:25 <@Dark_Shikari> ok, then I will
00:25 <@superdump> but you don't have the time/desire to port it over to ffh264?
00:25 <@Dark_Shikari> I don't have the time/desire to deal with michael when doing it
00:25 <@Dark_Shikari> the deblock code doesn't even NEED porting
00:26 <@Dark_Shikari> it can probably be pasted right in with zero changes
00:26 <@Dark_Shikari> there are just two problems with things:
00:26 <@Dark_Shikari> 1) Its GPL, this is not changing
00:26 <@Dark_Shikari> 2) It WILL CRASH with an unaligned stack, period
00:26 <@Dark_Shikari> so nobody will be able to get away with miscompiling lavc anymore
00:28 <@superdump> Dark_Shikari: do you have any kind of good estimate as to how much performance should be gained by reusing the x264 asm code?
00:29 <@Dark_Shikari> superdump: let me do the math
00:29 <@Dark_Shikari> all numbers will be using nehalem, as that's what Ihave to bench on with linux 64-bit
00:30 <@Dark_Shikari> actually, maybe not, as 64-bit doesn't compile x264's mmx asm for deblock
00:30 <@Dark_Shikari> as no 64-bit machine uses it
00:34 <@Dark_Shikari> ok, let us assume that x264's mmx deblock is 10% faster than lavc's inherently
00:34 <@Dark_Shikari> numbers on mmx vs sse in x264:
00:34 <@Dark_Shikari> sse2 luma V deblock :2.3x faster than mmx
00:35 <@superdump> on nehalem which is crazy fast
00:35 <@Dark_Shikari> SSE2 luma H deblock: 1.34x faster
00:35 <@Dark_Shikari> SSE2 luma H intra deblock: 1.44x faster
00:35 <@Dark_Shikari> SSE2 luma V intra deblock: 2.19x faster
00:35 <@Dark_Shikari> now let's grab an ffh264 profile
00:38 <@Dark_Shikari> 32-bit test, as the asm code has no mmx equivalent on 64-bit
00:38 <@Dark_Shikari> as the 64-bit asm uses, like, all of the registers
00:43 <@Dark_Shikari> oh btw, what's a good way to benchmark just decoding
00:43 <@Dark_Shikari> -f null?
00:43 <@mru> yes, and -an too
00:44 <@mru> if the clip has audio
00:44 <@Dark_Shikari> decoding 30 seconds of 1080p
00:45 <@Dark_Shikari> what the hell, filter_mb is 15.1% of runtime?
00:47 <@Dark_Shikari> 2485     15.1165  ffmpeg_g                 ffmpeg_g                 filter_mb
00:47 <@Dark_Shikari> 1576      9.5870  ffmpeg_g                 ffmpeg_g                 hl_decode_mb_simple
00:47 <@Dark_Shikari> 1446      8.7962  ffmpeg_g                 ffmpeg_g                 fill_caches
00:47 <@Dark_Shikari> 797       4.8482  ffmpeg_g                 ffmpeg_g                 put_h264_chroma_mc8_ssse3_rnd
00:48 <@Dark_Shikari> how can that be right?  shouldn't filter_mb_fast be being run?
00:49 <@Dark_Shikari> Oh.  I see where the problem comes from.
00:49 <@Dark_Shikari> h->pps.chroma_qp_diff
00:49 <@Dark_Shikari> filter_mb is run instead of filter_mb_fast when chroma_qp_offset is set.
00:49 <+uau> how did you profile it?
00:49 <@mru> evil sample?
00:49 <@Dark_Shikari> oprofile
00:49 <@Dark_Shikari> no no
00:49 <@Dark_Shikari> mru: x264 uses chroma qp offset by default
00:49 <+uau> how did you compile it?
00:49 <@Dark_Shikari> for over 3 months now
00:49 <@Dark_Shikari> uau: normally
00:49 <@Yuvi> I thought michael disabled filter_mb_fast a while ago
00:49 <+uau> there probably is no filter_mb_fast with default options
00:50 <@Dark_Shikari> no, uau, there is
00:50 <@Dark_Shikari> .... the issue is, lavc has an extremely nice loop filter calculation SIMD function
00:50 <@Dark_Shikari> written in MMX
00:50 <@Dark_Shikari> it is awesome
00:50 <@Dark_Shikari> ... but it isn't being used
00:50 <@Dark_Shikari> because it doesn't run if there's a chroma qp offset
00:50 <+uau> why would there be such a function?
00:50 <+uau> it's static and used in only one place
00:50 <@Dark_Shikari> this is retarded, since its trivial to deal with a chroma qp offset
00:50 <@Dark_Shikari> uau: huh? because its much faster
00:50 <+uau> i mean the binary
00:51 <+uau> why would the binary have such a function
00:51 <@Dark_Shikari> what do you mean
00:51 <@Dark_Shikari> what binary
00:51 <@Dark_Shikari> it isn't inlined, no
00:51 <+uau> why not?
00:51 <@Dark_Shikari> gcc does not inline everything that's only used once and is static
00:52 <+uau> it is inlined here
00:53 <+uau> no filter_mb_fast in h264.o
00:53 <@Dark_Shikari> ok, you're right, it isn't here either--but it isn't being used regardless
00:53 <@Dark_Shikari> I'll send an email to the ML about it
00:54 <@Dark_Shikari> actually, wait, this can't be right
00:54 <@Dark_Shikari> chroma_qp_diff is when the chroma qp offsets differ
00:54 <@Dark_Shikari> NOT when there is an offset
00:54 <@Dark_Shikari> oh, I see.  michael did disable filter_mb_fast, singlehandedly crippling decode speed.
00:54 <@Dark_Shikari> if(mb_x==0 || mb_y==mb_y_firstrow || !s->dsp.h264_loop_filter_strength || h->pps.chroma_qp_diff ||
00:54 <@Dark_Shikari> 1 ||
00:54 <@Dark_Shikari> gg michael
00:55 <@Dark_Shikari> way to write one of the best asm functions I've ever seen and then NOT USE IT
00:55 <@Dark_Shikari> that's why it isn't in your binary btw, it's never called
00:55 <@Dark_Shikari> or, more exactly, it instantly terminates
00:56 <@Dark_Shikari> fps went from 44 to 55 when re-enabling filter_mb_fast.
00:57 <@superdump> why is it disabled?
01:00 <@mru> wasn't there a hard to find bug or something?
01:01 <@Dark_Shikari> dunno, but I would prefer slightly wrong but fast decoding
01:01 <@Dark_Shikari> bitexact flags exist for a reason
01:03 <@Dark_Shikari> also, I find it unreasonable that fill_caches takes almost 10% of decoding time