# [Ffmpeg-devel] Native H.264 encoder (was: I'm giving up)

Panagiotis Issaris takis.issaris
Mon Dec 11 01:20:05 CET 2006

```Hi Michael,

On Sat, Dec 09, 2006 at 02:47:02AM +0100, Michael Niedermayer wrote:
>[...]
> > +    c = pieces[2][0]-pieces[2][3];
> > +    b = pieces[2][1]+pieces[2][2];
> > +    d = pieces[2][1]-pieces[2][2];
> > +    block[0][2] = a+b;
> > +    block[2][2] = a-b;
> > +    block[1][2] = (c<<1)+d;
> > +    block[3][2] = c-(d<<1);
> > +
> > +    a = pieces[3][0]+pieces[3][3];
> > +    c = pieces[3][0]-pieces[3][3];
> > +    b = pieces[3][1]+pieces[3][2];
> > +    d = pieces[3][1]-pieces[3][2];
> > +    block[0][3] = a+b;
> > +    block[2][3] = a-b;
> > +    block[1][3] = (c<<1)+d;
> > +    block[3][3] = c-(d<<1);
> > +}
>
> i assume that a for loop would slow this down significantly? if so a macro would
> make that much smaller without speed loss ...

I've tested this like this:
163 START_TIMER
164     DCTELEM pieces[4][4];
165     DCTELEM a, b, c, d;
166     int i;
167
168     for (i=0; i<4; i++)
169     {
170         a = block[0][i]+block[3][i];
171         c = block[0][i]-block[3][i];
172         b = block[1][i]+block[2][i];
173         d = block[1][i]-block[2][i];
174         pieces[0][i] = a+b;
175         pieces[2][i] = a-b;
176         pieces[1][i] = (c<<1)+d;
177         pieces[3][i] = c-(d<<1);
178     }
179
180     for (i=0; i<4; i++)
181     {
182         a = pieces[i][0]+pieces[i][3];
183         c = pieces[i][0]-pieces[i][3];
184         b = pieces[i][1]+pieces[i][2];
185         d = pieces[i][1]-pieces[i][2];
186         block[0][i] = a+b;
187         block[2][i] = a-b;
188         block[1][i] = (c<<1)+d;
189         block[3][i] = c-(d<<1);
190     }
191 STOP_TIMER("DCTFOR")

Resulting in:
...
924 dezicycles in DCTFOR, 8387443 runs, 1165 skipste=1350.3kbits/s
frame= 1989 q=-1.0 Lsize=   11046kB time=66.4 bitrate=1363.5kbits/s

When using the DCT without loops:
...
914 dezicycles in DCT, 8387499 runs, 1109 skipstrate=1351.4kbits/s
frame= 1989 q=-1.0 Lsize=   11046kB time=66.4 bitrate=1363.5kbits/s

But, the runs varied over a range bigger then the difference shown above.  I got
runs of 924, 944 and more decicycles for the DCT without the loops as well. Same
for the DCT with the for loops, decicycles spent in the DCT varied from 910 to
980. So, to me, it appears adding the loop doesn't hurt much. The tests above
took place on a Athlon64 X2 3800+. I will conduct the same tests tomorrow on a
P4 and see if it makes a considerable difference on that machine.

With friendly regards,
Takis

```