[FFmpeg-devel] PATCH: allow load_input_picture, load_input_picture to be architecture dependent

Robin Getz rgetz
Tue Jul 24 00:06:23 CEST 2007

On Thu 19 Jul 2007 09:11, Michael Niedermayer pondered:
> On Thu, Jul 19, 2007 at 07:35:55AM -0400, Marc Hoffman wrote:
> > We would be me ++ folks using Blackfin in real systems that are
> > waiting for better system performance.
> doing the copy in the background like you originally did requires
> a few more modifications than you did, that is you would have to add
> checks to several points so that we dont read the buffer before the 
> specfic part has been copied, this sounds quite hackish and iam not
> happy about it 

architecture specific optimisations are never a happy thing.

I would think that with the proper defines

#extern non_blocking_memcpy(void *dest, const void *src, size_t n);
#extern non_blocking_memcpy_done(void *dest);
#define non_blocking_memcpy(dest, src, n) memcpy(dest, src, n)
#define non_blocking_memcpy_done

it could be made less "hackish" - and still provide the optimisation.

> is mpeg4 encoding speed on blackfin really that important?

There are lots of people waiting for it to get better than it is. (Like me)

> cant you just optimize memcpy() in a compatible non background way?

memcpy is already as optimized as it can be
  - it is already in assembly
  - doing int (32-bit) copies when possible.
  - The loop comes down to:
    MNOP || [P0++] = R3 || R3 = [I1++];
   Which is a read/write in a single instruction cycle (if things are all
    in cache). This coupled with zero overhead hardware loops makes things
    as fast as they can be.
The things that slow this down are cache misses, cache flushes, external 
memory page open/close - things you can't avoid. If we could be doing compute
at the same time - it could make up for some of these stalls.

Based on our profiling - the single most executed instruction is the above 
read/write - in the libc memcpy - about ~10% of the total CPU load 
(depending on the codec). This is pretty high - and a good candidate for
the kind of optimisation that Marc is talking about.

This again is multiplied by the fact that the Blackfin architecture 
(as well as others) have a non-cached L1 - that runs at Core Clock 
speed (like cache), but has no cache tags, and therefore is cheaper
to implement - but harder to write software for :) 

This is L1 non-cached area is where Marc is doing alot of the video
processing. Copy things into this non-cached l1, and run computations
on it. Storing the answer back to L3.

Using the existing memcpy pollutes the data cache with the memory reads,
where the non_blocking version (since it would use DMA) would not do


More information about the ffmpeg-devel mailing list