[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try #2

Thu Aug 23 03:01:34 CEST 2007

Hi

On Thu, Aug 23, 2007 at 02:48:41AM +0200, Balatoni Denes wrote:
> Hi!
> 
> Thursday 23 August 2007 01:29-kor Michael Niedermayer ezt ?rta:
> > > In the row iteration it is not only permuted, but also shifted right four
> > > bits. But there is no shift instruction. So if you know a significantly
> > > faster way to shift the input right four bits, than do tell me.
> >
> > there is a shift instruction, sllx, wheres the problem with using that?
> 
> Well, you can't move between floating point and integer registers. So there 
> would be some additional storing to memory, reading from memory, some masking 
> is still needed, than the shift - all in all it's the same speed or slower 
> than 4 adds. Which I already said, that I don't really like, because of 
> marginal speedup, and more complexity.

i dont see where you would need masking, the msbs should be 0
additionally, there is block_last_index which allows you to skip
90% of the coeffs (as it tells you after what point all coeffs are 0)

> 
> > also iam realizing now that you read and work just with 32bits at a time
> > while the registers really are 64bit
> > so unles sparc need 2x as much time for 64bit instructions this is very
> > inefficient
> 
> Now I am kind of puzzled. I am using 64 bit registers. Like f0+f1 is one 64bit 
> register. f32, f34, ...f62 are 64 bit registers (these can't even be accessed 
> in 32 bit parts). So I really don't understand what you are saying. The big 
> macro computes 4 rows in parallel, how could it do that, without using 64 bit 
> registers?

hmm ok the registers are split in 2, i didnt know that (this design is
extreemly bad for a RISC cpu as it makes out of order execution very hard)

still the loads are 32bit
so lets check if i finally figured out how sparc asm works
1. you load everything by using 32bit loads into the low and high
   halfs of 64bit registers
2. you duplicate the input and mask on each side half the 16bit values
   away
3. you use fpackfix to shift half the input left by 4bit and pack the 2 16bit
   values which are seperated by 16 zero bits into a 32bit register
4. you subtract the other half from 2048
5. you do the same fpackfix on the second half
...

ok lets see
1. 8 instrucions are useless you can use 64bit loads
2. all 16 instructions ure unneeded
3+5 (16 instructions) are unneeded you can quickly shift the coeffs up to
    block_last_index by using C code
4. the subtract is done using 32bit effectively (half of the registers
   are 0 aka unused

so again
fix the permutation
shift left by 4 bits using C code or asm both stoping after block_last_index
do the 2048*(1<<4) subtraction if needed per 64bit

at this point you should have pretty much the same data as in your case
but very significantly faster

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Good people do not need laws to tell them to act responsibly, while bad
people will find a way around the laws. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070823/7d37f534/attachment.pgp>