[FFmpeg-devel] [PATCH] Channels correlation

Thu Oct 29 22:30:15 CET 2009

L'octidi 8 brumaire, an CCXVIII, Reimar D?ffinger a ?crit?:
> Not really, it is mostly that almost no compilers are able to handle x86's
> stack-based floating-point unit even remotely sensibly.

I am working on x86_64, the ABI uses the SSE2 unit, which is register based
unless I am mistaken.

> However you should at least start with extracting the constant sample1[i] into
> a separate variable.

That is one of the variations I tried, and it resulted in slower code.

>			If you don't believe me, look at the assembler code and
> you'll see that gcc is _not_ able to figure that out on its own (actually it
> simply can't, since you didn't use restrict on any of the pointer variables,
> but even with that it won't).

If I have to submit a new version of the patch, I will add the restrict
qualifier wherever I can; this is an habit I do not have yet but that I
should adopt.

Although, in this particular case, it may require some code movement, since
as is, the arguments to feed_correl can not have the restrict qualifier,
since samples[12] can be the same as state->buf.

> I also guess that unrolling the innermost loop should give quite an advantage,
> particular if you make special versions for common values of nch1.

Unrolling loops and specializing common values gives speed but make the code
much more complex. This makes sense in speed-critical code, like the H.264
decoder, but I do not think this is the case here.

> Once again confirming my rule: If code speed changes strangely with source
> code changes, the compiler-generated assembler will make you weep.

I just compared the two versions: the current one and the one with an extra
variable for samples1[i]: the current version reads:

	movsd   (%rbp,%rdi,8), %xmm0
	mulsd   (%rbp,%rax,8), %xmm0

with the extra variable, it becomes:

	movapd  %xmm1, %xmm0

and the same mulsd, and %xmm1 is loaded outside the loop with

	movsd   (%r10), %xmm1
	addq    $8, %r10

The rest of the code is mostly identical; for some reason, the roles of %r13
and %r14 are exchanged, but that is all.

Nonetheless, the version with %xmm1 instead of the memory access is slower.
I do not understand why, and I am not sure I want to understand.

Do you insist on these optimizations? The current code is not overly
inefficient, and it has the advantage to be reasonably readable. I feel it
would be more reasonable to leave them aside until someone actually needs
this code to be faster.

The integer overflow still needs to be fixed, though. I would like your
advice as to how to react to it: print a diagnosis and abort the encoding?

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20091029/17657a27/attachment.pgp>