[Ffmpeg-devel] Snow slicing support

Tue Apr 11 09:57:04 CEST 2006

hi,

Oded Shimon wrote:
> On Thu, Apr 06, 2006 at 05:19:57PM +0200, Michael Niedermayer wrote:
>>On Mon, Apr 03, 2006 at 09:47:58PM +0300, Oded Shimon wrote:
>>>Just thought this patch might be of general interest to anyone. What I 
>>>find interesting about it, is that it's not the sliced output that helps 
>>>at all, but the rearranging of how the data is handled, of unpacking 
>>>coeffs seperately from decoding image. It is actually a surprisngly huge 
>>>difference on my cpu, almost 20% faster in some cases. This code trades 
>>>off code switches against data switches, and even in my high res video 
>>>(944x544), code switches prooved to be far more expensive...
>>>
>>>I don't really expect this patch to go in CVS, but I am interested in any 
>>>comments if anyone has any...
>>
>>this needs testing with different resolutions, bitrates and cpus
>>(320x240 720x576 p4 athlon ...)
>>
>>is this speed difference also there with other gcc versions
>>and most interresting is it there too at lower -O
>>
>>if its consistently faster (or at least not slower) then this should be
>>applied
> 
> 
> Do you have any suggestions with how to test this efficiently? cache 
> performance is hard to benchmark, especially in high level code. :/
> 
> using mplayer -benchmark several times gave me wild results:
> 
> without patch:
> BENCHMARKs: VC: 108.872s VO:  17.123s A:   1.205s Sys:  32.865s =  160.065s
> BENCHMARKs: VC: 102.149s VO:  15.351s A:   1.198s Sys:  33.220s =  151.918s
> BENCHMARKs: VC:  99.299s VO:  15.920s A:   1.517s Sys:  34.233s =  150.970s
> BENCHMARKs: VC: 101.674s VO:  16.263s A:   1.284s Sys:  32.215s =  151.436s
> 
> with patch:
> BENCHMARKs: VC:  97.398s VO:  15.675s A:   1.299s Sys:  36.363s =  150.734s
> BENCHMARKs: VC:  95.429s VO:  15.321s A:   1.174s Sys:  38.613s =  150.536s
> BENCHMARKs: VC:  96.610s VO:  15.528s A:   1.181s Sys:  37.275s =  150.594s
> BENCHMARKs: VC:  95.816s VO:  15.297s A:   1.197s Sys:  38.248s =  150.558s
> 
> (these are old benchmarks, and on that single file)
> 
> In this case the difference was still obvious, but the results are very 
> inaccurate. is there a better way for this? maybe START_TIMER around the 
> whole decode() function?

That what I'd do. Since measuring the whole decode() function is likely
to be inaccurate due to interruptions, I suggest you run it a number of
times (let's say, 100 times) and take the mean value, and the shortest
value.
This should take care of the measurement fuzzinesss.

Guillaume