[FFmpeg-devel] transcoding on nvidia tesla

christophelorenz christophelorenz
Sun Feb 10 23:12:23 CET 2008

Ian Caulfield wrote:

>On Feb 1, 2008 12:05 AM, M?ns Rullg?rd <mans at mansr.com> wrote:
>>Is there any reason to believe that each of these threads has the
>>power of a full CPU at its disposal?
>Nowhere near - they're more like a funky SIMD - the threads are
>grouped into 16's, which follow the same execution path - if threads
>diverge, they have to be serialised. However, with careful
>programming, very good memory bandwidth can be achieved. Memory
>bandwidth on/off chip can be an issue though. I don't see 1000x
>speedups for video coding - 10x seema more likely, at the cost of a
>lot of development time.
Having done some gpu dev, I can tell that there's some good and some 
very bad things to do...

Easy ones, -huge- performance increase :
Rescaling with various algos, color space conversions, basic deblocking, 
denoise ...

More tricky, probably faster by factor of 10 but with quite some 
optimisation and dev time :
(i)Motion compensation, (i)dct, wavelets ...

Useless, same speed or 10x slower : (because conditionnal branching 
cannot be avoided)
Byte stream parsing, sorting...

Total lost of time and 100x slower on gpu : (gpu probably has to 
emulates all the required bit functions and data impose a serial 
operation so no parallelisation is possible)
Bit stream parsing....

Using operations that need branching is very hard to make fast.
In most cases, it is faster to process both branches and make a 
conditionnal assignment later... (where possible)
It is not that gpus are that bad with branching (threads are grouped by 
2x8 so it "only" serialize 8 threads)
but more that cpus became excessively good at it.
CUDA has a much better memory transfer performance than DirectX / 
OpenGL, examples show 3Gbytes/sec (up and down) but it vastly depends on 
motherboard used.
Anyhow, it is still a memory copy. If you need to do this often it will 
ruin performance.

Memory bandwidth on the card is also huge. (bus is 384bits wide if I 
remember...) so any massive memory operation will have an excellent speedup.
Probably an advantage for HD processing.

You also don't need a tesla to code with CUDA. Any Geforce 8800 will 
probably do. (need different load dispatching between threads)

Someone tried to port the wavelets part of the Jpeg2000 
encoding/decoding on gpu (but not using cuda)
Even with the latest card, there's no significant performance gain when 
using the gpu. I don't know the exact reasons, but I know that the 
bitstream parsing and arithmetic coding represent an important part of 
the process and cannot be ported to gpu easily.

I believe the real future of video codecs is more in the proprietary 
chip parts that are dedicated to video processing. (and unfortunately 
the ones that are not standardized and public)


More information about the ffmpeg-devel mailing list