[FFmpeg-devel] pre discussion around Blackfin dct_quantize_bfin routine

Michael Niedermayer michaelni
Wed Jun 13 01:39:42 CEST 2007


On Tue, Jun 12, 2007 at 09:24:22AM -0400, Marc Hoffman wrote:
> On 6/12/07, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> > On 6/12/07, Michael Niedermayer <michaelni at gmx.at> wrote:
> > > Hi
> > >
> > > On Tue, Jun 12, 2007 at 05:49:19AM -0400, Marc Hoffman wrote:
> > > > Does these allow me to ignore the DCT permutation?
> > >
> > > no it would still break if the user selected the other integer idct
> >
> > Is it possible to add a configure option to be able to compile ffmpeg
> > only with IDCT that do not need permutation (and do not allow the user
> > to select other idct)? At least it would eliminate table lookups in
> > many places (replace table lookups with a macro which expands either
> > to table lookup or the value itself). The point is that ARM devices
> > are heavily CPU limited and ARMv5TE optimized IDCT does not use
> > permutation. Blackfin powered devices may be CPU limited too (Marc can
> > probably privide more information about blackfin performance). I'll
> > try to do some benchmarks on ARM and post some results later.
> >
> On Blackfin you want to elliminate those permutations they are costly.
>  Basically, something like:
>     j=scantable[i];
>     x=data[j];
> expands into:
>     p0=[p1++];
>     3 cycle delay waiting for p0 to validate.  Thank god its interlocked.
>     r0=[p0];
> you don't really want to do this very often.  The execution pipeline
> looks something like this
>     IF0 IF1 IF2 ID AC M0 M1 M2 EX WB
> AC is where addresses are computed before they are feed into the memory pipe.
> Mx are memory access stages they overlap with other things not needed
> for this discussion.
> IFx instruction fetch
> ID instruction decode
> WB write back
> EX execute, actually Blackfin has two stages of execution the other
> one overlaps with M2.
> There are 3 stages of execution in the pipeline for accessing the
> memory on the parts and the feed back of the load into the register p0
> needs to wait until the end of the pipeline before its used.

why not read 3 into 3 registers and then write them, doesnt this avoid
the delays?

> This is what I/we have to work with on these lighter weight embedded
> processors.  We are talking about fairly simple micro architectures in
> comparison to things like PPC and X86.  Actually, this pipeline layout
> works very well for numerical calculations that don't require
> permutations :).
> #include <stdio.h>
> main ()
> {
>   int clk;
>   int mem[10];
>   while (1) {
>   asm (
>        "%0=cycles;\n\t"
>        "p0=[%1];\n\t"
>        "r0=[p0];\n\t"
>        "r0=cycles;\n\t"
>        "%0=r0-%0 (ns);\n\t"
>        : "=d" (clk) : "a" (mem) : "R0","P0");
>   printf ("%d\n", clk);
>   }
> }
> results in 6.... subtract 1 for the last read of cycles we get 5, and
> the two instructions which execute gives you 3 dead cycles.  What is

benchmarking 2 instructions with a single iteration is meaningless even on
a simple pipelined arch IMHO, you should at least do 10 read+write

also decoding involes a mandatory permutation
so no matter what idct_permutation is set to it will be the same speed and
wisely setting the idct permutation can simplify the idct and thus speed
it up, this is a high level optimization and wont make code slower no matter
how expensive the permutation is as there arent more permutations done

the extra cost is just on the encoder side, where its just a single if()
if its the no permutation case ...


Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

it is not once nor twice but times without number that the same ideas make
their appearance in the world. -- Aristotle
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070613/4b85b942/attachment.pgp>

More information about the ffmpeg-devel mailing list