[FFmpeg-devel] [RFC] H.264/SQV3 separation: h264data.h
Tue Dec 16 01:34:00 CET 2008
On Mon, Dec 15, 2008 at 01:33:49AM +0100, Michael Niedermayer wrote:
> On Sun, Dec 14, 2008 at 03:58:26PM -0800, Jason Garrett-Glaser wrote:
> > > no, the apparent slowdown of cathedral-beta2-400extra-crop-avc.mp4 is not ok
> > Why is this not ok, while disabling filter_mb_fast was OK? I measure
> > a 25% performance increase on a sample here by re-enabling
> > filter_mb_fast. Why is 0.5% here intolerable, but 25% there is just
> > fine?
> One has to weigh the advantage against the disadvantage
> here, advanatge, of the change is largely cosmetic
You consider code maintainability cosmetic?
> disadvantage is a 0.5% speed loss, after 10 such changes our decoder is 5%
The separation of the two decoders can only be done once and will only
be done once. There is absolutely no danger for this to become a
> And thouse claiming randomness, well, if its random a different gcc version
> shouldnt show the speed loss and in that case it could be reconsidered, though
> again if 4 out of 5 gcc versions show the speed loss it still doesnt look too
> good nor random. But if 2 out of 4 are slower and 2 faster it would be ok.
Do you claim that H.264 decoding performance is not sensitive to random
changes and even to changes unrelated to H.264?
> also ive developed and optimized other codecs like the mpeg4 variants based
> on this concept and they are pretty much the fastest around. Had i applied
> random changes that made the code by 0.5% slower they would not be as
Well, our H.264 decoder is not the fastest around and it still has areas
where it can be improved vastly. So skipping refactoring for 0.3% speed
in one sample - don't forget that the other one improved - is premature
But most importantly I think this obsession with speed at all costs is
hurting people's motivation and losing us much more than 0.3%
performance. I know that I already spent far too much time on this
split without real progress and my motivation is going in one direction
There have been tons of changes to h264.c that were not subjected to the
speed inquisition, everything PAFF-related jumps to my mind for example.
The H.264/SVQ3 split is being singled out for unfathomable reasons.
Probably because I made the error of waking sleeping dogs by running
benchmarks in the first place.
Below is a slightly edited IRC log from yesterday night. It should
speak volumes about developer motivation and places where *substantial*
speed improvements can be achieved.
00:18 <@Dark_Shikari> DonDiego: while you're at it, split h264.c into meaningful sections so that you don't have all of h264 decoding in one gigantic file
00:18 <@Dark_Shikari> residual decoding should go in cabac and cavlc c files, respectively
00:18 <@DonDiego> Dark_Shikari: why don't you split the file?
00:19 <@DonDiego> you know h.264 much better than i do..
00:19 <@Dark_Shikari> DonDiego: I don't maintain it, and I'm not prepared to go through the nightmare of trying to convince michael
00:19 <@DonDiego> what nightmare?
00:20 <@Dark_Shikari> you just posted above about your annoyance at michael not accepting the SVQ3 split
00:20 <@Dark_Shikari> which is far far far simpler than what I'm describing
00:20 <@Dark_Shikari> you know what I'm talking about
00:20 <@DonDiego> i wanted it spelled out
00:20 <@DonDiego> i don't think michael is opposed to the split you suggest
00:20 <@DonDiego> he knows that h264.c is far too large for its own good
00:21 <@Dark_Shikari> sure, but he'll probably find the split makes things like 0.1% slower
00:21 <@Dark_Shikari> and reject it
00:21 <@DonDiego> just his obsession with performance is insane
00:21 <@DonDiego> well, let me duke that out with him, if i manage to convince him, you can go one step further
00:22 <@Dark_Shikari> well the odd thing about it is that for example x264 is far far far more obsessed with performance than michael's ffh264
00:23 <@Dark_Shikari> ... yet we don't have to massively sacrifice code maintainability to do it
00:23 <@Dark_Shikari> so I'm not exactly sure why he's like this
00:23 <@Dark_Shikari> if he really cared that much he'd focus on actually making it faster..
00:23 <@superdump> more/better simd you mean?
00:24 <@Dark_Shikari> better simd, less overhead
00:24 <@Dark_Shikari> simd-wise, we've already done it for him
00:24 <@superdump> make the point
00:25 <@Dark_Shikari> ok, then I will
00:25 <@superdump> but you don't have the time/desire to port it over to ffh264?
00:25 <@Dark_Shikari> I don't have the time/desire to deal with michael when doing it
00:25 <@Dark_Shikari> the deblock code doesn't even NEED porting
00:26 <@Dark_Shikari> it can probably be pasted right in with zero changes
00:26 <@Dark_Shikari> there are just two problems with things:
00:26 <@Dark_Shikari> 1) Its GPL, this is not changing
00:26 <@Dark_Shikari> 2) It WILL CRASH with an unaligned stack, period
00:26 <@Dark_Shikari> so nobody will be able to get away with miscompiling lavc anymore
00:28 <@superdump> Dark_Shikari: do you have any kind of good estimate as to how much performance should be gained by reusing the x264 asm code?
00:29 <@Dark_Shikari> superdump: let me do the math
00:29 <@Dark_Shikari> all numbers will be using nehalem, as that's what Ihave to bench on with linux 64-bit
00:30 <@Dark_Shikari> actually, maybe not, as 64-bit doesn't compile x264's mmx asm for deblock
00:30 <@Dark_Shikari> as no 64-bit machine uses it
00:34 <@Dark_Shikari> ok, let us assume that x264's mmx deblock is 10% faster than lavc's inherently
00:34 <@Dark_Shikari> numbers on mmx vs sse in x264:
00:34 <@Dark_Shikari> sse2 luma V deblock :2.3x faster than mmx
00:35 <@superdump> on nehalem which is crazy fast
00:35 <@Dark_Shikari> SSE2 luma H deblock: 1.34x faster
00:35 <@Dark_Shikari> SSE2 luma H intra deblock: 1.44x faster
00:35 <@Dark_Shikari> SSE2 luma V intra deblock: 2.19x faster
00:35 <@Dark_Shikari> now let's grab an ffh264 profile
00:38 <@Dark_Shikari> 32-bit test, as the asm code has no mmx equivalent on 64-bit
00:38 <@Dark_Shikari> as the 64-bit asm uses, like, all of the registers
00:43 <@Dark_Shikari> oh btw, what's a good way to benchmark just decoding
00:43 <@Dark_Shikari> -f null?
00:43 <@mru> yes, and -an too
00:44 <@mru> if the clip has audio
00:44 <@Dark_Shikari> decoding 30 seconds of 1080p
00:45 <@Dark_Shikari> what the hell, filter_mb is 15.1% of runtime?
00:47 <@Dark_Shikari> 2485 15.1165 ffmpeg_g ffmpeg_g filter_mb
00:47 <@Dark_Shikari> 1576 9.5870 ffmpeg_g ffmpeg_g hl_decode_mb_simple
00:47 <@Dark_Shikari> 1446 8.7962 ffmpeg_g ffmpeg_g fill_caches
00:47 <@Dark_Shikari> 797 4.8482 ffmpeg_g ffmpeg_g put_h264_chroma_mc8_ssse3_rnd
00:48 <@Dark_Shikari> how can that be right? shouldn't filter_mb_fast be being run?
00:49 <@Dark_Shikari> Oh. I see where the problem comes from.
00:49 <@Dark_Shikari> h->pps.chroma_qp_diff
00:49 <@Dark_Shikari> filter_mb is run instead of filter_mb_fast when chroma_qp_offset is set.
00:49 <+uau> how did you profile it?
00:49 <@mru> evil sample?
00:49 <@Dark_Shikari> oprofile
00:49 <@Dark_Shikari> no no
00:49 <@Dark_Shikari> mru: x264 uses chroma qp offset by default
00:49 <+uau> how did you compile it?
00:49 <@Dark_Shikari> for over 3 months now
00:49 <@Dark_Shikari> uau: normally
00:49 <@Yuvi> I thought michael disabled filter_mb_fast a while ago
00:49 <+uau> there probably is no filter_mb_fast with default options
00:50 <@Dark_Shikari> no, uau, there is
00:50 <@Dark_Shikari> .... the issue is, lavc has an extremely nice loop filter calculation SIMD function
00:50 <@Dark_Shikari> written in MMX
00:50 <@Dark_Shikari> it is awesome
00:50 <@Dark_Shikari> ... but it isn't being used
00:50 <@Dark_Shikari> because it doesn't run if there's a chroma qp offset
00:50 <+uau> why would there be such a function?
00:50 <+uau> it's static and used in only one place
00:50 <@Dark_Shikari> this is retarded, since its trivial to deal with a chroma qp offset
00:50 <@Dark_Shikari> uau: huh? because its much faster
00:50 <+uau> i mean the binary
00:51 <+uau> why would the binary have such a function
00:51 <@Dark_Shikari> what do you mean
00:51 <@Dark_Shikari> what binary
00:51 <@Dark_Shikari> it isn't inlined, no
00:51 <+uau> why not?
00:51 <@Dark_Shikari> gcc does not inline everything that's only used once and is static
00:52 <+uau> it is inlined here
00:53 <+uau> no filter_mb_fast in h264.o
00:53 <@Dark_Shikari> ok, you're right, it isn't here either--but it isn't being used regardless
00:53 <@Dark_Shikari> I'll send an email to the ML about it
00:54 <@Dark_Shikari> actually, wait, this can't be right
00:54 <@Dark_Shikari> chroma_qp_diff is when the chroma qp offsets differ
00:54 <@Dark_Shikari> NOT when there is an offset
00:54 <@Dark_Shikari> oh, I see. michael did disable filter_mb_fast, singlehandedly crippling decode speed.
00:54 <@Dark_Shikari> if(mb_x==0 || mb_y==mb_y_firstrow || !s->dsp.h264_loop_filter_strength || h->pps.chroma_qp_diff ||
00:54 <@Dark_Shikari> 1 ||
00:54 <@Dark_Shikari> gg michael
00:55 <@Dark_Shikari> way to write one of the best asm functions I've ever seen and then NOT USE IT
00:55 <@Dark_Shikari> that's why it isn't in your binary btw, it's never called
00:55 <@Dark_Shikari> or, more exactly, it instantly terminates
00:56 <@Dark_Shikari> fps went from 44 to 55 when re-enabling filter_mb_fast.
00:57 <@superdump> why is it disabled?
01:00 <@mru> wasn't there a hard to find bug or something?
01:01 <@Dark_Shikari> dunno, but I would prefer slightly wrong but fast decoding
01:01 <@Dark_Shikari> bitexact flags exist for a reason
01:03 <@Dark_Shikari> also, I find it unreasonable that fill_caches takes almost 10% of decoding time
More information about the ffmpeg-devel