[FFmpeg-devel] PATCH: allow load_input_picture, load_input_picture to be architecture dependent

Michael Niedermayer michaelni
Tue Jul 24 22:38:45 CEST 2007


Hi

On Tue, Jul 24, 2007 at 02:08:25PM -0400, Marc Hoffman wrote:
> Hi
> 
> On 7/23/07, Michael Niedermayer <michaelni at gmx.at> wrote:
> >
> > Hi
> >
> > On Mon, Jul 23, 2007 at 06:06:23PM -0400, Robin Getz wrote:
> > > On Thu 19 Jul 2007 09:11, Michael Niedermayer pondered:
> > > > On Thu, Jul 19, 2007 at 07:35:55AM -0400, Marc Hoffman wrote:
> > > > > We would be me ++ folks using Blackfin in real systems that are
> > > > > waiting for better system performance.
> > > >
> > > > doing the copy in the background like you originally did requires
> > > > a few more modifications than you did, that is you would have to add
> > > > checks to several points so that we dont read the buffer before the
> > > > specfic part has been copied, this sounds quite hackish and iam not
> > > > happy about it
> > >
> > > architecture specific optimisations are never a happy thing.
> >
> > no, most of them are clean and well seperated but this dma memcpy thing
> > is a mess and has no chance to reach svn unless someone shows first that
> > all alternatives are worse (benchmarks absolutely required)
> > alternatives are, using the preserve flag and changing ffmpeg.c
> > and doing the dma copy but wait until its done
> 
> 
> I have been thinking along these lines for the input image used in the
> mpegvideo encode process. The patch would be pretty clean but we would
> incurr a frame delay for it to work correctly.  When I get the data in a
> easily reviewable format I will provide it to you.  I don't think Blackfin
> is the only processor which would benifit from this type of system
> optimization.

well i have my doubts that this is even a good idea for blackfin


[...]
> >
> > > > is mpeg4 encoding speed on blackfin really that important?
> > >
> > > There are lots of people waiting for it to get better than it is. (Like
> > me)
> > >
> > > > cant you just optimize memcpy() in a compatible non background way?
> > >
> > > memcpy is already as optimized as it can be
> > >   - it is already in assembly
> > >   - doing int (32-bit) copies when possible.
> > >   - The loop comes down to:
> > >     MNOP || [P0++] = R3 || R3 = [I1++];
> > >    Which is a read/write in a single instruction cycle (if things are
> > all
> > >     in cache). This coupled with zero overhead hardware loops makes
> > things
> > >     as fast as they can be.
> > >
> > > The things that slow this down are cache misses, cache flushes, external
> > > memory page open/close - things you can't avoid. If we could be doing
> > compute
> > > at the same time - it could make up for some of these stalls.
> >
> > it should be faster to read several and then write several things instead
> > of read 1 write it, read next write it, ...
> 
> 
> This does 4 samples at a time, and the memory system brings 32bytes at a
> wack into the L1 caches.  It doesn't matter if we moved the data byte wise
> or in quads the problem is not here its in moving the data from external to
> internal memory.  

there are 2 flaws in your reasoning
(lets assume the L1 cache in write back, ive not checked if it actually is but
well you will likely correct me if iam wrong ...)
if you read the first 4 bytes from cache line X then the first thing which
happens is that the cpu will write out the cache line to memory if it was
dirty. next the whole line is
read from memory into the cache, depending on cpu these may or may not
stall the cpu until its done with the whole line
when you now write the 4 bytes which also start at a cache line (iam assuming
things are aligned ...) then again first the cache line is written out if its
dirty and then the next line is possibly read (depending on cpu again)

so depending on the exact cache architcture your first 4byte read and write
causes 2-4 cache lines (32byte each) to be transfered between memory and cache
the next 7 4byte read writes leave the memory idle
so depending on how well the blackfin can do the cache line read write while
the cpu accesses the lines this could be very ineffective
and reordering the accesses to avoid to concentrate all the cache line
transfers at one spot could be a good idea

the second thing is that you can read more than 32byte before writing them
thus reducing the penalty for doing random 32byte transfers (i assume here
that sequencial r/w is faster than random with the memory used with blackfin
systems

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No human being will ever know the Truth, for even if they happen to say it
by chance, they would not even known they had done so. -- Xenophanes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070724/d524a064/attachment.pgp>



More information about the ffmpeg-devel mailing list