[FFmpeg-devel] idct/fdct optimizations on arm

Tue Mar 9 05:45:51 CET 2010

> > I read doc/optimization.txt and I have a question.
> > I'm using h263 video encoder and swscaler to stretch images and I'm
> > kind of low on CPU power (windows mobile, arm11 cpu). I have some
> > custom code to use hardware acceleration for idct/fdct. I see there
> > are optimized version for arm in libavcodec/arm/, so I just wanted to
> > ask 1) if I can get noticeable improvements in video encoder with
> more
> > optimized idct code and 2) if it's easy to do (to hook up custom
> > functions for that) :)
> 
> In encoding more time is usually spent in motion search than in
> (i)dct, but optimising the transforms is still worthwhile.

Motion search, what kind of function are these? I see motion estimation mentioned in optimization.txt: pix_abs16x16%%, pix_abs8x8%%?

> 
> What hardware do you have with accelerated dct?

I think most mobiles have some sort of HW for that. Maybe api is not publicly available, but how else my prehistoric SE phone does better video than my little app that encodes video using ffmpeg's h263 encoder that uses asm optimized code?
Quite popular chip: http://ati.amd.com/kr/products/imageon100/imageon100.pdf Claims accelerated support for MPEG-4/JPEG decoding (successor of this chip sits in my HTC phone). Obviously it doesn't take a file and blits it to the screen, most likely it provides some sort of functions that are common in media processing. I don't know if they have public api available, though. Once I worked on SonyEricsson P1i and needed access to front camera on that phone. That api isn't public, but the work was for SE, so obviously I was able to get API from them still it was like that api doesn't exist in universe, they had to contact multiple departments in multiple countries and like a month later (if not more than a month!) I received an email from some guy in japan with instructions and needed code.

> 
> > Also, it seems that my code suffers the most from swscale (I timed
> > execution times and, surprisingly, stretching an image takes more
> time
> > than h263 encoding) and there is no optimized version for arm in
> > swscale
> 
> I'm not at all surprised by that.  Scaling is very CPU intensive.

Yes, it is. I'm not that good in arm asm but I don't see where there could be big improvements if coded for CPUs prior to cortex (seems that it has some sort of simd support, from looking at the avcodec code :). For armv5e maybe some dsp instructions to do multiple multiply-adds(plus saturation). I didn't even look at the swscaler code to see what's going on, it just seems to me that it should do this kind of math on pixels for scaling. Actually, the slowest and biggest hit for me was color space conversion from R565 to yuv, it's incredibly slow and CPU/Memory intensive (cannot be done in-place, and for every pixel there is quite complex math involved), also conversion tables aren't very helpful. Usually, these HW chips support YUV in some form and are capable of stretching hardware surfaces. These hardware capabilities should be exposed through directdraw interfaces in Windows Mobile if there are proper drives available (like DDrawSurface can be loaded with YUV data and Blt function does arbitrary stretching).