[FFmpeg-devel] [PATCH] avoid mb_xy recalculation in h264
Sat May 10 13:56:03 CEST 2008
On Wed, May 07, 2008 at 10:42:04PM -0400, Alexander Strange wrote:
> On May 7, 2008, at 3:15 PM, Michael Niedermayer wrote:
> > On Wed, May 07, 2008 at 02:49:20PM -0400, Alexander Strange wrote:
> >> On May 7, 2008, at 8:53 AM, Michael Niedermayer wrote:
> >>> On Wed, May 07, 2008 at 03:05:47AM -0400, Alexander Strange wrote:
> >>>> This line appears in a lot of h264.c functions:
> >>>> const int mb_xy= s->mb_x + s->mb_y*s->mb_stride;
> >>>> It only changes once per MB, so we can add it to the context and
> >>>> only
> >>>> recalculate it in decode_slice().
> >>>> I put it at the end of H264Context, but it could be moved up
> >>>> earlier if
> >>>> someone likes it there. The position doesn't seem to affect
> >>>> performance
> >>>> much; it can't be moved next to mb_x/mb_y since that would put it
> >>>> in
> >>>> MpegEncContext, and none of the other codecs use it.
> >>>> Patches:
> >>>> 1- does that, updating it for MBAFF when needed
> >>>> 2- removes some newly unused variables
> >>>> 3- does the same thing to svq3.c
> >>>> Before: avg 8.799s max 8.8s min 8.794s
> >>>> After: avg 8.645s max 8.651s min 8.641s
> >>> Knowing the cpu and compiler would also be usefull.
> >> gcc 4.0.1, core 2 x86-32 Darwin
> >> I guess it's about the same as the usual distro compiler, but this
> >> isn't
> >> really a sensitive messing-with-the-register-allocator patch.
> > How does the change compare to a splited (litterally duplicated)
> > decode_cabac_residual()
> > with the mb_xy only calculated for the "cat"s which actually need it?
> > That is one for cat=0/3 and one for the others
> > or
> > with the mb_xy calc moved in the if(cat=0/3) which use it?
> Actually, the compiled decode_cabac_residual is barely different at
> all. gcc loads h->mb_xy, immediately spills it to the same place in
> the stack, and the rest of the function is the same. I suppose
> changing mb_xy to h->mb_xy for the DC cats might be an improvement.
> Diffing the asm shows:
> * less imull
> * hl_motion is different (mb_x load moved into an if())
> * write_back_intra_pred_mode() is inlined and removed
> * slightly different stack sizes (hard to know if that means anything)
> Making write_back_intra_pred_mode() always_inline doesn't change cavlc
> time, but explains about half the speed change for cabac.
My question is still the same, how does adding mb_xy into the context
compare (speedwise/benchmark) to changing the decode_cabac_residual()
so it does not use mb_xy 90% of the time?
I think the main speed difference (besides inline changes) is in
its kinda logic if you consider that mb_xy is calculated 16 times alone
for the luma blocks but not used at all. Changing that to loading it from the
context is not correct. decode_cabac_residual() should not touch mb_xy
when it does not need it.
That is what id like to see is
How much speed difference there is between svn and the calculation of
mb_xy moved into the cats which need it?
How much speed gain we can achive by making write_back_intra_pred_mode()
inline if any.
What speed differenc remains between the best combination of above and in
addition putting mb_xy in the context?
What iam trying to do is chop the changes up into small parts and see their
individual contriution, so we do not commit something which might have a
negative speed impact by mistake.
Also if we know if X is better inlined or not inlined we can force gcc
to always do what is better.
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Dictatorship naturally arises out of democracy, and the most aggravated
form of tyranny and slavery out of the most extreme liberty. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: Digital signature
More information about the ffmpeg-devel