[FFmpeg-devel] [PATCH] Indeo5 decoder
Maxim
max_pole
Mon May 18 12:39:37 CEST 2009
Hello,
> [...]
>>>>>>> let me clarify my question, what is gained by merging a multiply and shift
>>>>>>> into the table?
>>>>>>> is it faster? if so then by how much?
>>>>>>>
>>>>>>>
>>>>>>>
>>>> I did some research on that! Here are answers on your questions:
>>>>
>>>> Question: Is it faster? if so then how much?
>>>>
>>>> Yes, it's faster. I measured calc "time" using START/STOP_TIMER macs. I
>>>> did two tests on two different videos: one containing mostly light
>>>> colors (DPS190indeo.avi) and another containing mostly dark colors
>>>> (haegemonia.avi). The reason for this choice was that the light colors
>>>> require higher scalefactors to be used and therefore a multiply by a
>>>> higher number.
>>>> First test measured dezicycles consumed by the inverse quantization
>>>> using TABLE lookup/MUL. It was done in my x86 Laptop equipped with the
>>>> Indel Core Duo processor at 2 GHz. Here are the raw numbers:
>>>>
>>>>
>>> could you show me the used code?
>>> Iam interrested to see how you did the MUL
>>>
>>>
>> in the "decode_block":
>>
>> START_TIMER;
>>
>> q = (base_tab[pos] * scale_tab[quant]) >> 8;
>>
>
>
>> q = (q) ? q : 1;
>>
>> if (q != 1 && val) {
>>
>
> can val even be 0 here?
>
Normally no because all zeros will be eliminated by the RLE coding...
I removed the check but I'll test it on scalable videos later...
> and what is this odd q stuff doing
>
"q" is the scalefactor the transform coefficients will be multiplied
with in order to get the dequantized values. Slant transform
coefficients are quantized non-uniformly, i.e. the DC coeff has a
smaller scalefactor then AC ones...
> if(!q) q=1
> or
> q += !q
> or
> if(q) {
>
> seem better to me
>
"q += !q" choosed to ensure that q >= 1. Thanks!
> also the speed check must be over the whole decode frame function
> because even if decode_block is faster the tables can flush other
> things out of the cache and cause a overall speedloss
>
Here the new speed measurements around the "decode_frame" (x86, Intel
Core Duo, 2GHz):
DPS190indeo.avi - TABLE
-------------------------------------------------------
130191900 dezicycles in decode_planes, 1 runs, 0 skips
220224150 dezicycles in decode_planes, 2 runs, 0 skips
161996025 dezicycles in decode_planes, 4 runs, 0 skips
131639512 dezicycles in decode_planes, 8 runs, 0 skips
133566046 dezicycles in decode_planes, 16 runs, 0 skips
143302050 dezicycles in decode_planes, 32 runs, 0 skips
150954344 dezicycles in decode_planes, 64 runs, 0 skips
150563808 dezicycles in decode_planes, 128 runs, 0 skips
147217229 dezicycles in decode_planes, 256 runs, 0 skips
DPS190indeo.avi - MUL
-------------------------------------------------------
122467050 dezicycles in decode_planes, 1 runs, 0 skips
147113400 dezicycles in decode_planes, 2 runs, 0 skips
136291612 dezicycles in decode_planes, 4 runs, 0 skips
158392706 dezicycles in decode_planes, 8 runs, 0 skips
148443628 dezicycles in decode_planes, 16 runs, 0 skips
147498121 dezicycles in decode_planes, 32 runs, 0 skips
146668821 dezicycles in decode_planes, 64 runs, 0 skips
143953437 dezicycles in decode_planes, 128 runs, 0 skips
144747379 dezicycles in decode_planes, 256 runs, 0 skips
haegemonia.avi - TABLE
-------------------------------------------------------
1 = 272769600 dezicycles in decode_planes, 1 runs, 0 skips
2 = 325494900 dezicycles in decode_planes, 2 runs, 0 skips
3 = 349585312 dezicycles in decode_planes, 4 runs, 0 skips
4 = 295016437 dezicycles in decode_planes, 8 runs, 0 skips
5 = 290776912 dezicycles in decode_planes, 16 runs, 0 skips
6 = 308271351 dezicycles in decode_planes, 32 runs, 0 skips
7 = 379863679 dezicycles in decode_planes, 64 runs, 0 skips
8 = 404734541 dezicycles in decode_planes, 128 runs, 0 skips
9 = 416773314 dezicycles in decode_planes, 256 runs, 0 skips
10= 416313615 dezicycles in decode_planes, 512 runs, 0 skips
11= 415796653 dezicycles in decode_planes, 1024 runs, 0 skips
haegemonia.avi - MUL
---------------------------------------------------------
1 = 274425600 dezicycles in decode_planes, 1 runs, 0 skips
2 = 360142275 dezicycles in decode_planes, 2 runs, 0 skips
3 = 459152625 dezicycles in decode_planes, 4 runs, 0 skips
4 = 341867775 dezicycles in decode_planes, 8 runs, 0 skips
5 = 314274796 dezicycles in decode_planes, 16 runs, 0 skips
6 = 340190240 dezicycles in decode_planes, 32 runs, 0 skips
7 = 385613568 dezicycles in decode_planes, 64 runs, 0 skips
8 = 407847684 dezicycles in decode_planes, 128 runs, 0 skips
9 = 417292871 dezicycles in decode_planes, 256 runs, 0 skips
10= 415117064 dezicycles in decode_planes, 512 runs, 0 skips
11= 413800347 dezicycles in decode_planes, 1024 runs, 0 skips
rayman_bunnies.avi - TABLE
-----------------------------------------------------------
208000950 dezicycles in decode_planes, 1 runs, 0 skips
202094325 dezicycles in decode_planes, 2 runs, 0 skips
215491275 dezicycles in decode_planes, 4 runs, 0 skips
208255931 dezicycles in decode_planes, 8 runs, 0 skips
228715546 dezicycles in decode_planes, 16 runs, 0 skips
243207993 dezicycles in decode_planes, 32 runs, 0 skips
241559500 dezicycles in decode_planes, 64 runs, 0 skips
244833645 dezicycles in decode_planes, 128 runs, 0 skips
322727504 dezicycles in decode_planes, 256 runs, 0 skips
321311195 dezicycles in decode_planes, 512 runs, 0 skips
rayman_bunnies.avi - MUL
-----------------------------------------------------------
213306900 dezicycles in decode_planes, 1 runs, 0 skips
212299500 dezicycles in decode_planes, 2 runs, 0 skips
240144337 dezicycles in decode_planes, 4 runs, 0 skips
221974350 dezicycles in decode_planes, 8 runs, 0 skips
238906218 dezicycles in decode_planes, 16 runs, 0 skips
249726839 dezicycles in decode_planes, 32 runs, 0 skips
245828191 dezicycles in decode_planes, 64 runs, 0 skips
248050925 dezicycles in decode_planes, 128 runs, 0 skips
326313400 dezicycles in decode_planes, 256 runs, 0 skips
324575479 dezicycles in decode_planes, 512 runs, 0 skips
By the first two videos is the code using the "MUL" faster than the
"TABLE" one during the last video requires more time to decode. The
reason for it can be alot multiplies by high numbers...
All these test were done several times. The numbers are vary but the
differences stay the same...
Should I throw the table generation/decoding code away?
No idea how it works on any others platforms/processors...
Regards
Maxim
More information about the ffmpeg-devel
mailing list