[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try #2

Thu Aug 23 12:01:34 CEST 2007

Hi!

Thursday 23 August 2007 03:01-kor Michael Niedermayer ezt ?rta:
> > Well, you can't move between floating point and integer registers. So
> > there would be some additional storing to memory, reading from memory,
> > some masking is still needed, than the shift - all in all it's the same
> > speed or slower than 4 adds. Which I already said, that I don't really
> > like, because of marginal speedup, and more complexity.
>
> i dont see where you would need masking, the msbs should be 0
> additionally, there is block_last_index which allows you to skip
> 90% of the coeffs (as it tells you after what point all coeffs are 0)

Why whould the msbs of the four 16 bit numbers be 0? If they contain -1 for 
example. 

> still the loads are 32bit
> so lets check if i finally figured out how sparc asm works
> 1. you load everything by using 32bit loads into the low and high
>    halfs of 64bit registers
> 2. you duplicate the input and mask on each side half the 16bit values
>    away
> 3. you use fpackfix to shift half the input left by 4bit and pack the 2
> 16bit values which are seperated by 16 zero bits into a 32bit register 4.
> you subtract the other half from 2048
> 5. you do the same fpackfix on the second half
> ...
>
> ok lets see
> 1. 8 instrucions are useless you can use 64bit loads
> 2. all 16 instructions ure unneeded
> 3+5 (16 instructions) are unneeded you can quickly shift the coeffs up to
>     block_last_index by using C code
> 4. the subtract is done using 32bit effectively (half of the registers
>    are 0 aka unused

What hurts me the most, is that you don't see the beauty of the fpackfix 
mess. ;) I thought about for a day, until I came up with it.

1. with 64bit loads they would be in the wrong order
2. I don't see why it is not a problem if I shift in all ones, when shifting 
left (as it would be the case if there is a -1 in the lower 16 bit of the 32 
bit register). Also I have to mask the high 16 bits when fpackfixing the low 
16, because fpackfix is clever, and it would clamp the value to 32767 which 
is not what I want.
3+5 I don't think it would be that much faster
4. that is true, but there was no better way

> so again
> fix the permutation
> shift left by 4 bits using C code or asm both stoping after
> block_last_index do the 2048*(1<<4) subtraction if needed per 64bit
>
> at this point you should have pretty much the same data as in your case
> but very significantly faster

Okay, so we spent some hours with the problem, and what we came up is a cca. 
5% speedup (cca. 2% overall), and longer code (because I still think what I 
had is kind of elegant). I don't think it's a very significant speedup, in 
the sense, that what wasn't playable before is still not playable (eg. 720p 
HDTV). Also as the idct is rather inaccurate, it won't be used by default, so 
not many people would even be using it, so I think optimizing this even more 
is somewhat wasted effort. So, to tell the truth, I am not overly 
enthusiastic about the new solution.

> [...]

bye
Denes

ps: it has to be said that the row/column transpose you wrote in your next 
email is nice. It should work, and somehow it didn't occur to me, that it is 
possible to do it that way.