[FFmpeg-devel] [PATCH] SPARC VIS simple_idct

Balatoni Denes dbalatoni
Fri Aug 24 20:55:55 CEST 2007


Hi Michael!

So yes, I decided that I don't want all my work so far to be abandoned.
The chance of somebody picking up the patch and resubmitting it is 
zero, there are very few sparc desktop users, and even less developers 
obiously - maybe it's because the desktop SPARC CPUs are crap and expensive 
too.

I gave a deep thought to each of your sugestions:

Friday 24 August 2007 01:07-kor Michael Niedermayer ezt ?rta:
> > > ok ill accept the fpadd16 for now if you move them after the for /fcmpd
> > > as 
> > > they arent needed before and after that they would often not be
> > > executed :) 
> > >
> > 
> > That would be right in the middle of the idct4row macro. So there would be 
> > insane amounts of code duplication - well unless, I cut up idct4row by 
> > columns. If this was the last show stopper, I would do it, though ;)
> 
> all you have to do is to put a ADDBLAH macro after the branches in the
> idct4row
> then define it appropriately before each idct4row

I might have misunderstood you, but are you saying to shift only the two 
registers that are used in each column  at the start of every column 
calculation? I think it would make the code slower, as these VIS instructions 
have a 4 clock latency (at least), so the cpu would be idling for two clocks 
after each fpadd16. Also I really don't want to butcher the code, intruducing 
more and more nested macros, until everything is a mess. So for these reasons 
I didn't implement this part.

> > > anyway, please try to negate the high 8 bit or the coeffs and
> > > try
> > > fmul8sucks16
> > > fmul8ulcks16
> > > fpsub16
> > >
> > > maybe this reduces the rounding errors and its the same amount
> > > of code ...
> > 
> > I see what you mean - and a good idea, which didn't occur to me - but I am 
> > skeptical. The problem is, as I see it, is because of the two 
> > instructions,  
> > the error can be 1 (vs. 0.5 if it was one instruction with rounding). Now 
> > playing with the signs doesn't change this unfortunatelly.
> 
> its an idea, testing takes 5minutes for someone with a sparc, its just a
> matter of *-1 the 8bits and running dct-test
> a theoretical analysis of the rounding behavior would take more time

here it is. It is not really better, so we should stick to the original 
algorithm IMHO.

-1056  1035  -606  -324  1266  1041  -228   700
  730    94  1094   597   393   279   640   368
  441   924    90   347   462   557   272   437
 1061   311   627   496   449   385   441   447
  369   726   324   476   519   490   385   449
  555   404   471   460   522   493   514   477
  192   551   254   395   418   482   314   467
  287   498   338   378   379   468   373   391
IDCT SIMPLE-VIS: err_inf=1 err2=0.06218359 syserr=0.06330000 maxout=260 
blockSumErr=13
IDCT SIMPLE-VIS: 1431.0 kdct/s

> > > hmm what happens if you just round the coeffs down to 8bit? / throw the
> > > half the multiplies out, the code as it is is inaccurate anyway due to
> > > sparc misdesign maybe the loss from 8bit coeffs isnt that big ...
> > 
> > The current situtation is not that bad (of course the idct is switched of
> > by  
> > default, but for just viewing a movie it should be mostly adequate). 8bit 
> > coeffs imho would make the situation very bad though.
> 
> again its a matter of <5min to test this (hey just add +128>>8 to the
> coeffs) 
> you needed more time to write this mail

results don't look encouraging - it's worse in some respect than mlib, and 
still slower

  944  -817  1855  1574  -711   732  -433  -343
-1018  -514  1656    10  1197   561   978   553
 1747  1428  -299   690  -102   617   341   284
 1518     1   814   256   523   392   639   585
 -492  1336    -3   687   173   566   476   487
  577   544   646   494   463   770   459   587
 -584  1083   273   746   204   486   449   499
  -84   813   301   518   429   666   399   538
IDCT SIMPLE-VIS: err_inf=4 err2=1.00296406 syserr=0.09275000 maxout=262 
blockSumErr=80
IDCT SIMPLE-VIS: 1248.8 kdct/s

Btw checking these two things was not 5 minutes, but two hours - the first 
problems appeared, when I saw that dct-test didn't recognize TRANSPOSE_PERM.

> > > it seems that the sparc has twice as many registers as what would be
> > > needed 
> > > to keep the coefficients in registers between teh rows and columns
> > > transforms that would avoid 32 instructions ...
> > 
> > They are kept in registers, nothing ever touches the coefficients.
> 
> STOREROWS writes them, LOADTRANS reads them, maybe i should have called
> them something else than coefficients but thats not really changing anything

Almost all registers are used, check the code. Either I don't understand what 
you are saying, or you are suggesting that I rewrite the whole stuff.

> > > the idct should not store the output in memory but leave it in registers
> > > the ff_simple_idct_put/add then should call the idct (or have it
> > > inlined) 
> > > and the clamping code should just work with the registers
> > > this avoids another 32 instructions
> > 
> > This could be done too. Although there would be some great code
> > duplication  
> > (unless one uses macros),

Because only half of the values are in registers, the other half has to be 
loaded. The net effect is that the XXX_pixels_clamped have to be mostly 
rewritten - that is renaming all the registers in some convoluted ways, or 
doing the same to IDCT4ROW and the other macros - all for about 1% speed up 
(cca. 0.3% overall best). I am sorry, but I am not going to do that, because 
I think it's just stupid - loads of work for no real benefit (on the 
contrary, the result is some quite convoluted messy code, with some weird 
register order).

bye
Denes




More information about the ffmpeg-devel mailing list