[FFmpeg-devel] [PATCH] SPARC VIS simple_idct

Fri Aug 24 23:59:00 CEST 2007

Hi

On Fri, Aug 24, 2007 at 08:55:55PM +0200, Balatoni Denes wrote:
> Hi Michael!
> 
> So yes, I decided that I don't want all my work so far to be abandoned.
> The chance of somebody picking up the patch and resubmitting it is 
> zero, there are very few sparc desktop users, and even less developers 
> obiously - maybe it's because the desktop SPARC CPUs are crap and expensive 
> too.
> 
> I gave a deep thought to each of your sugestions:
> 
> Friday 24 August 2007 01:07-kor Michael Niedermayer ezt ?rta:
> > > > ok ill accept the fpadd16 for now if you move them after the for /fcmpd
> > > > as 
> > > > they arent needed before and after that they would often not be
> > > > executed :) 
> > > >
> > > 
> > > That would be right in the middle of the idct4row macro. So there would be 
> > > insane amounts of code duplication - well unless, I cut up idct4row by 
> > > columns. If this was the last show stopper, I would do it, though ;)
> > 
> > all you have to do is to put a ADDBLAH macro after the branches in the
> > idct4row
> > then define it appropriately before each idct4row
> 
> I might have misunderstood you, but are you saying to shift only the two 
> registers that are used in each column  at the start of every column 
> calculation? I think it would make the code slower, as these VIS instructions 
> have a 4 clock latency (at least), so the cpu would be idling for two clocks 
> after each fpadd16. Also I really don't want to butcher the code, intruducing 
> more and more nested macros, until everything is a mess. So for these reasons 
> I didn't implement this part.

ok, but then you should move the for up so its not immedeatly before
a fcmpd using its result

> 
> > > > anyway, please try to negate the high 8 bit or the coeffs and
> > > > try
> > > > fmul8sucks16
> > > > fmul8ulcks16
> > > > fpsub16
> > > >
> > > > maybe this reduces the rounding errors and its the same amount
> > > > of code ...
> > > 
> > > I see what you mean - and a good idea, which didn't occur to me - but I am 
> > > skeptical. The problem is, as I see it, is because of the two 
> > > instructions,  
> > > the error can be 1 (vs. 0.5 if it was one instruction with rounding). Now 
> > > playing with the signs doesn't change this unfortunatelly.
> > 
> > its an idea, testing takes 5minutes for someone with a sparc, its just a
> > matter of *-1 the 8bits and running dct-test
> > a theoretical analysis of the rounding behavior would take more time
> 
> here it is. It is not really better, so we should stick to the original 
> algorithm IMHO.

ok

[...]
> > > > it seems that the sparc has twice as many registers as what would be
> > > > needed 
> > > > to keep the coefficients in registers between teh rows and columns
> > > > transforms that would avoid 32 instructions ...
> > > 
> > > They are kept in registers, nothing ever touches the coefficients.
> > 
> > STOREROWS writes them, LOADTRANS reads them, maybe i should have called
> > them something else than coefficients but thats not really changing anything
> 
> Almost all registers are used, check the code. Either I don't understand what 
> you are saying, or you are suggesting that I rewrite the whole stuff.

i dont make suggestions based on how much work it is to change the previous
submission just based on it being possible and providing an advantage

there are 32 64bit registers these should be enough to do the idct without
an intermediate store-load
the whole 8x8 block needs 16registers, 7 for the constant coefficients
that leaves 9 available

if you think that this patch will be accepted due to you whining how much
time you spend on it already then you live in some strange fantasy world

either you make the improvements (or argue why its not possible or wouldnt
make sense ...) or your patch wont be applied
also keep in mind i have as well spend considerable time on this already
(reading the sparc asm manuals reviewing your code, ...) and dont complain
hey i even have learned alot about sparcs

if i accept half optimal/working patches because the author has not enough
time or interrest to implement it properly then ffmpeg would be half broken
and running at half the speed it does now
its not the "its just 0.X% overall" your changes in the last 2 days made
your idct 40% faster or so (it was around 1000 and then 1400/sec IIRC)
if i would accept patches in general which
are 40% slower then they could be then ffmpeg as a whole would b 40%
slower then it could be ...

also about the code becoming messy, i am sure this can be avoided by
properly implementing it

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Opposition brings concord. Out of discord comes the fairest harmony.
-- Heraclitus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070824/69473afb/attachment.pgp>