[FFmpeg-devel] IRC shit smearing

Jason Garrett-Glaser darkshikari
Sat Jan 16 03:58:07 CET 2010

> We surely dont gain L1 cache hits, a 16x16 block and all the variables
> easily fit in.

Are you so sure about this?  With hyperthreading on a Core i7, for
example, you have a total of 16K L1 cache per thread for data.  Is
this enough for everything, especially considering the overhead caused
by 64-byte cache lines?  I _strongly_ doubt it.  And even if you're
right, every byte we take out of L1 cache usage is a byte that is left
over for motion compensation to not cache-miss on.

> So whats left is keeping things in registers (not with gcc in this world IMHO)
> ,merging branches
> and letting the CPU reorder instructions accross decode and pixel_handling
> That said, the loop filter needs cbp, mv, nnz if iam not mistaken so its
> still far from as localized as one might think
> also our other decoders are similarly split and are quite a bit faster than
> the alternatives _when_ our decoders where optimized by someone who invested
> a serious effort.
> In that sense i think that multithreaded frame decoding would gain us most
> and after that i think there are many optimizations that would gain more
> speed per messiness than your suggestion above.
> But i would be very happy if you could elaborate on why such rearchitecture
> would be faster in your oppinon. (it does not seem particularly hard to do
> such change, i just dont terribly like it and doubt its speed advantage)

Benefits other than cache:

1) Not all data in fill_caches needs to be loaded; only the relevant
stuff to the current block.
2) It's generally more efficient to merge loops together.  For example, we do:

for( i = 0; i < 16; i++ ) { decode idct block() }
... later ...
for( i = 0; i < 16; i++ ) { if( nnz ) { idct } )

I would think:

for( i = 0; i < 16; i++ ) { if( decode idct block() > 0 ) { idct } }

would be more efficient.

3) More important than anything else...

This is the way CoreAVC does it.  Accordingly, everything is templated
for progressive/PAFF/MBAFF and CAVLC/CABAC.

CoreAVC is a full 50% faster than libavcodec (as measured by Mans),
with a single thread, when using the exact same assembly code on the
exact same compiler.  And as someone who has read through the entire
codebase, it does not have a single "major" optimization that
libavcodec doesn't, such as SIMD deblock-strength calculation.  There
is no magic here; if anything, ffmpeg has a large variety of
optimizations that CoreAVC doesn't have, such as ff_emulated_edge_mc.
This 50% cannot be made up in ffmpeg merely by tons of

If we want to compete, we should start by trying to do things the way
that faster decoders do them.

Dark Shikari

More information about the ffmpeg-devel mailing list