[FFmpeg-devel] what is h264_idct_add8()?

Mon Sep 6 18:33:13 CEST 2010

Hi again,

On Mon, Sep 6, 2010 at 11:18 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> bash-3.2$ grep h264_idct_add8 ../libavcodec/*.c
> ../libavcodec/h264dsp.c: ? ?c->h264_idct_add8 ? ? ?= ff_h264_idct_add8_c;
> ../libavcodec/h264idct.c:void ff_h264_idct_add8_c(uint8_t **dest,
> const int *block_offset, DCTELEM *block, int stride, const uint8_t
> nnzc[6*8]){
> bash-3.2$
>
> adding an abort() at the top of its implementations in h264dsp_mmx.c
> has no effect. What is the intention of this function? Can we remove
> it?

Mans referred me to the commit msg of r16207:
"Use the new idct functions (except chroma as it was slower in benchmarks"
I'm assuming this is the chroma code which wasn't used because it was
slower (?).

Having said that, could the no-speed-gain because of gcc miscompiling
it? gcc unrolls it (which is fine, although suboptimal), but the for
(i=16;i<24;i++) if (cond1 || cond2) { conditional code } is compiled
as this:

if (!cond1) goto end;
back:
[.. conditional code goes here ..]
next:
[.. the above is repeated for each loop iteration because gcc unrolls
the loop ..]
    RET
end:
if (cond2) goto back;
goto next;
[.. this too is repeated for each loop iteration ..]

Writing the asm out manually (and also using a loop instead of
unrolling) using setup cond1; or cond1, block[16*i] leads to a 50%
(!!) speed increase for add16intra, and might have a similar effect
here (plus smaller code = better cache?). If luma-only lead to a 0.5%
speed increase, this might have some actually-noticeable effect also.

Michael, do you still have the patch that enables using idct_add8()
for chroma (probably in h264.c) so I can test it performance of
yasmified idct_add8 against the current code that doesn't use
idct_add8()?

(I hope to finish yasmification of h264_idct today or tomorrow.)

Ronald