[Ffmpeg-devel] [PATCH] simple_idct_armv5te optimization
Tue Oct 3 21:59:57 CEST 2006
On 9/30/06, M?ns Rullg?rd <mru at inprovide.com> wrote:
> I'll have a look at it when I get time. Unfortunately, that will
> probably not be within the next few days.
OK, that's not a problem, I can wait. Anyway, I have released a new
build of MPlayer for Nokia 770 with this patch:
Got only positive feedback and no bugreports so far, but maybe it is
too early :)
As for the patch, its real use in decoding video is hard to notice
just because in real video files, idct rows are almost empty (in 80%
of cases in my tests on relatively low bitrate files) and the
optimized parts of code with MAC instructions rarely get into action.
Looks like 80/20 rule in action and an excellent example of 'useless'
optimization :) But optimizing columns processing code in a very
similar way, real improvement should be visible not only in dct-test,
but in real video files too. As for code size, looks like mpeg4
decoder fits 32K or even 16K of instruction cache fine anyway (tested
in valgrind using cachegrind tool), surely code density on x86 and arm
may be different, but it is still a good approximation. Just for a
comparison tested h264 decoding and it really runs out of instructions
cache. Anyway, inlining can be easily kept reasonable so that the code
does not grow much and functions call overhead is not so noticeable
(it is 6 cycles for call/ret pair). One more issue is with data cache
misses, they can probably affect performance (or maybe not, depending
on the context in which simple_idct gets called). ARM926EJ-S does not
support PLD instruction, but technical reference manual from ARM site
mentions that 'Allocate on read-miss is supported. The caches perform
critical-word first cache refilling'. So that the first word on cache
miss gets available early before the rest of cache line gets loaded.
It can be used in the same rows processing in idct, so we can read
row and start to do some calculations, for example we can assume
that all the other row values are zero and calculate the result early.
After the rest of cache line gets loaded and we get row .. row,
we can check if all the other row values are really empty and either
save the precalculated result or discard it :) I tried to run such
test and it really improves performance on cache misses. I also have
other ideas for optimization, so it seems to me that this code can be
still improved quite a lot.
Well, the only question which remains is whether I should do further
optimizations of this simple_idct_armv5te.S code or you would like to
do it yourself once you get some free time. I would not want to step
on your toes and can switch on optimizing something else :)
More information about the ffmpeg-devel