[FFmpeg-devel] [RFC] Loop unrolling in C code for?'vector_fmul_*' functions

Michael Niedermayer michaelni
Mon Apr 21 00:51:06 CEST 2008


On Mon, Apr 21, 2008 at 01:01:43AM +0300, Siarhei Siamashka wrote:
> On Thursday 10 January 2008, Michael Niedermayer wrote:
> > On Tue, Jan 08, 2008 at 02:20:07AM +0200, Siarhei Siamashka wrote:
> > [...]
> >
> > > But at least for ARM, looks like the compiler is quite stupid and can't
> > > schedule instructions properly as seen from the benchmark results (just
> > > unrolling loop is not enough and some extra tweaks are needed
> > > in 'vector_fmul_c_other_unrolled'). VFP coprocessor has a high result
> > > latency (8 cycles), though throughput is quite good (1 cycle) and some
> > > other nice features which can improve performance exist (documantation
> > > for VFP can be found at http://www.arm.com). The compiler (gcc) does not
> > > even try to reorder instructions and pipeline is just stalled most of the
> > > time. I would not be surprised if the compiler screwed up and generated
> > > something suboptimal on more complicated floating point stuff as well
> > > (fft and imdct).
> >
> > Please submit reports to the gcc devels for every case of suboptimal code
> > generated by gcc you stumble across!
> > Its much better if gcc would be improved instead of everyone having to hand
> > schedule c code.
> 
> Getting back to this issue.
> 
> It is good that I did not submit a report to the gcc devels, otherwise I would
> make an idiot out of myself submitting invalid report :)
> 
> The problem is that
> 
> void vector_fmul_c_unrolled(float *dst, const float *src, int len)
> {
>     int i;
>     for(i = 0; i < len; i += 8) {
>         dst[i + 0] *= src[i + 0];
>         dst[i + 1] *= src[i + 1];
>         dst[i + 2] *= src[i + 2];
>         dst[i + 3] *= src[i + 3];
>         dst[i + 4] *= src[i + 4];
>         dst[i + 5] *= src[i + 5];
>         dst[i + 6] *= src[i + 6];
>         dst[i + 7] *= src[i + 7];
>     }
> }
> 
> and
> 
> void vector_fmul_c_other_unrolled(float *dst, const float *src, int len)
> {
>     int i;
>     register float tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
>     for(i = 0; i < len; i += 8) {
>         tmp0 = src[i + 0];
>         tmp1 = src[i + 1];
>         tmp2 = src[i + 2];
>         tmp3 = src[i + 3];
>         tmp4 = src[i + 4];
>         tmp5 = src[i + 5];
>         tmp6 = src[i + 6];
>         tmp7 = src[i + 7];
>         dst[i + 0] *= tmp0;
>         dst[i + 1] *= tmp1;
>         dst[i + 2] *= tmp2;
>         dst[i + 3] *= tmp3;
>         dst[i + 4] *= tmp4;
>         dst[i + 5] *= tmp5;
>         dst[i + 6] *= tmp6;
>         dst[i + 7] *= tmp7;
>     }
> }
> 
> are not actually identical.
> 
> The compiler needs to take into account the case when 'dst' and 
> 'src' buffers overlap and it is impossible to optimize the code 

As others have said, add restrict.
And if that doesnt help submit bugreport to gcc.

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

When you are offended at any man's fault, turn to yourself and study your
own failings. Then you will forget your anger. -- Epictetus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080421/5fb17011/attachment.pgp>



More information about the ffmpeg-devel mailing list