[FFmpeg-devel] IRC shit smearing

Sat Jan 16 04:53:04 CET 2010

On Fri, Jan 15, 2010 at 09:58:07PM -0500, Jason Garrett-Glaser wrote:
> > We surely dont gain L1 cache hits, a 16x16 block and all the variables
> > easily fit in.
> 
> Are you so sure about this?  With hyperthreading on a Core i7, for
> example, you have a total of 16K L1 cache per thread for data.  Is
> this enough for everything, especially considering the overhead caused
> by 64-byte cache lines?  I _strongly_ doubt it. 

I would expect 16k to be plenty but i didnt count it all
for the actual coefficients its just 768 byte that we need and that
with a single idct at a time would be 128 (due to 8x8 idct)
all the pixels are needed anyway due to loop filtering
and as you mention 64byte cache lines, we could try to reorder fields in
H264Context to make the most of that.

> And even if you're
> right, every byte we take out of L1 cache usage is a byte that is left
> over for motion compensation to not cache-miss on.
> 
> > So whats left is keeping things in registers (not with gcc in this world IMHO)
> > ,merging branches
> > and letting the CPU reorder instructions accross decode and pixel_handling
> > That said, the loop filter needs cbp, mv, nnz if iam not mistaken so its
> > still far from as localized as one might think
> >
> > also our other decoders are similarly split and are quite a bit faster than
> > the alternatives _when_ our decoders where optimized by someone who invested
> > a serious effort.
> > In that sense i think that multithreaded frame decoding would gain us most
> > and after that i think there are many optimizations that would gain more
> > speed per messiness than your suggestion above.
> > But i would be very happy if you could elaborate on why such rearchitecture
> > would be faster in your oppinon. (it does not seem particularly hard to do
> > such change, i just dont terribly like it and doubt its speed advantage)
> 
> Benefits other than cache:
> 
> 1) Not all data in fill_caches needs to be loaded; only the relevant
> stuff to the current block.

can you be more specific? we can and do skip things already at many places

> 2) It's generally more efficient to merge loops together.  For example, we do:
> 
> for( i = 0; i < 16; i++ ) { decode idct block() }
> ... later ...
> for( i = 0; i < 16; i++ ) { if( nnz ) { idct } )
> 
> I would think:
> 
> for( i = 0; i < 16; i++ ) { if( decode idct block() > 0 ) { idct } }
> 
> would be more efficient.

well but as it is we can do 2, 4, 16 or whatever idcts at once, with the
interlaved code thats no longer possible
so while we gain something we also loose something. Its not at all clear
which way would be faster. A benchmark of this of course would be very
interresting

> 
> 3) More important than anything else...
> 
> This is the way CoreAVC does it. 

ohh no, thats not the kind of argument i like.

> Accordingly, everything is templated
> for progressive/PAFF/MBAFF and CAVLC/CABAC.

i did split CAVLC from CABAC of course there is stuff missing still that
is checked at runtime but its quite trivial to change this.

if you think compiling our code for progressive frame/ field / MBAFF frame
3 times would make sense then ill work on that. I had this idea at least
once in the distant past but pushed it away due to it being uhm not
pretty.

> 
> CoreAVC is a full 50% faster than libavcodec (as measured by Mans),

as meassured by mans on arm IIRC

> with a single thread, when using the exact same assembly code on the
> exact same compiler.  And as someone who has read through the entire
> codebase, it does not have a single "major" optimization that
> libavcodec doesn't, such as SIMD deblock-strength calculation.  There
> is no magic here; if anything, ffmpeg has a large variety of
> optimizations that CoreAVC doesn't have, such as ff_emulated_edge_mc.

> This 50% cannot be made up in ffmpeg merely by tons of
> micro-optimizations.

If you think i ever proposed to do only micro optimizations than you
misunderstood me, just look at what kind of optimizationw and restructurings
where done on h263 code years ago. There was alot of deep rearchitecturing

> 
> If we want to compete, we should start by trying to do things the way
> that faster decoders do them.

i didnt do it that way for h263/*mpeg4 and i still beated them all at
least back then. Of course h263 was simpler and i spend alot more time
on it. That said iam of course very interrested in what ideas we can
borrow from CoreAVC i just dont like the argument "its better because
CoreAVC does it" based on that we would have to remove some optimizations
if we have tricks CoreAVC does not.

Also if you could make a todo / to-try list of what could be optimized
i know you posted one mail and i still have that, as well as this
mail but i think it would be more ideal if this all was in one
place to which people could add (that is roundup or just commit a
list tp h264.c or a h264todo.txt)

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No snowflake in an avalanche ever feels responsible. -- Voltaire
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100116/14880fc7/attachment.pgp>