[FFmpeg-devel] [RFC] Loop unrolling in C code for 'vector_fmul_*' functions

Mon Apr 21 00:01:43 CEST 2008

On Thursday 10 January 2008, Michael Niedermayer wrote:
> On Tue, Jan 08, 2008 at 02:20:07AM +0200, Siarhei Siamashka wrote:
> [...]
>
> > But at least for ARM, looks like the compiler is quite stupid and can't
> > schedule instructions properly as seen from the benchmark results (just
> > unrolling loop is not enough and some extra tweaks are needed
> > in 'vector_fmul_c_other_unrolled'). VFP coprocessor has a high result
> > latency (8 cycles), though throughput is quite good (1 cycle) and some
> > other nice features which can improve performance exist (documantation
> > for VFP can be found at http://www.arm.com). The compiler (gcc) does not
> > even try to reorder instructions and pipeline is just stalled most of the
> > time. I would not be surprised if the compiler screwed up and generated
> > something suboptimal on more complicated floating point stuff as well
> > (fft and imdct).
>
> Please submit reports to the gcc devels for every case of suboptimal code
> generated by gcc you stumble across!
> Its much better if gcc would be improved instead of everyone having to hand
> schedule c code.

Getting back to this issue.

It is good that I did not submit a report to the gcc devels, otherwise I would
make an idiot out of myself submitting invalid report :)

The problem is that

void vector_fmul_c_unrolled(float *dst, const float *src, int len)
{
    int i;
    for(i = 0; i < len; i += 8) {
        dst[i + 0] *= src[i + 0];
        dst[i + 1] *= src[i + 1];
        dst[i + 2] *= src[i + 2];
        dst[i + 3] *= src[i + 3];
        dst[i + 4] *= src[i + 4];
        dst[i + 5] *= src[i + 5];
        dst[i + 6] *= src[i + 6];
        dst[i + 7] *= src[i + 7];
    }
}

and

void vector_fmul_c_other_unrolled(float *dst, const float *src, int len)
{
    int i;
    register float tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
    for(i = 0; i < len; i += 8) {
        tmp0 = src[i + 0];
        tmp1 = src[i + 1];
        tmp2 = src[i + 2];
        tmp3 = src[i + 3];
        tmp4 = src[i + 4];
        tmp5 = src[i + 5];
        tmp6 = src[i + 6];
        tmp7 = src[i + 7];
        dst[i + 0] *= tmp0;
        dst[i + 1] *= tmp1;
        dst[i + 2] *= tmp2;
        dst[i + 3] *= tmp3;
        dst[i + 4] *= tmp4;
        dst[i + 5] *= tmp5;
        dst[i + 6] *= tmp6;
        dst[i + 7] *= tmp7;
    }
}

are not actually identical.

The compiler needs to take into account the case when 'dst' and 
'src' buffers overlap and it is impossible to optimize the code 
from 'vector_fmul_c_unrolled' function scheduling instructions just 
like in 'vector_fmul_c_other_unrolled'.

The fact that 'dst' and 'src' buffers don't overlap is one more useful
constraint which can be exploited when doing optimizations.

Those who are interested in this issue, can look at '-fargument-alias',
'-fargument-noalias' and '-fargument-noalias-global' gcc options.

Too bad that I did not find any gcc function attribute that could be used to
tell the compiler that pointer arguments from some particular function do not
alias without using this setting for all the project risking to break
something.

Anyway, at least it in this case gcc was not at fault :)

-- 
Best regards,
Siarhei Siamashka