Optimizing SSE2 code and beyond...

srimks · ‎09-04-2009

Given an flow of SSE2 instructions on Linux x86_64 Intel 5345 processor as below -

---------------(a)--------------
"movaps %xmm5, %xmm12 \n\t"
"mulsd %xmm15, %xmm12 \n\t"
"addsd %xmm2, %xmm12 \n\t"

"movaps %xmm9, %xmm0 \n\t"
"mulsd %xmm14, %xmm0 \n\t"
"addsd %xmm0, %xmm12 \n\t"

"movaps %xmm11, %xmm0 \n\t"
"mulsd %xmm13, %xmm0 \n\t"
"addsd %xmm0, %xmm12 \n\t"

"cvtsd2ss %xmm12, %xmm12 \n\t"
"movss %xmm12, (%r10,%rdi) \n\t"
----------------------------------

for section of code as -
-------------
crd[apple] = (double)crdhello + d * k + d * k + d * k;
-------------

The above pattern is for "d * k" followed by "d * k" and finally by "d * k" respectively.

Similarly for -

------------------------(b)-------------------

crd[apple] = (double)crdhello + d * k + d * k + d * k;

whose respective pattern of Inline asm is -

--------------
"movsd 40(%rsp), %xmm0 \n\t"
"mulsd %xmm15, %xmm0 \n\t"
"addsd%xmm4, %xmm0 \n\t"

"movaps %xmm6, %xmm12 \n\t"
"mulsd %xmm14, %xmm12 \n\t"
"addsd %xmm12, %xmm0 \n\t"

"movaps %xmm7, %xmm12 \n\t"
"mulsd %xmm13, %xmm12 \n\t"
"addsd %xmm12, %xmm0 \n\t"

"cvtsd2ss %xmm0, %xmm0 \n\t"
"movss %xmm0, 4(%r10,%rdi) \n\t"
---------------------------

and for the last pattern which is -

-------------(c)------------
crd[apple] = (double)crdhello + d * k + d * k + d * k;
------

the Inline asm is -

------------------------------
"mulsd %xmm8, %xmm15 \n\t"
"addsd %xmm3, %xmm15 \n\t"

"mulsd %xmm10, %xmm14\n\t"
"addsd %xmm14, %xmm15 \n\t"

"mulsd %xmm1, %xmm13 \n\t"
"addsd %xmm13, %xmm15 \n\t"

"cvtsd2ss %xmm15, %xmm13 \n\t"
"movss %xmm13, 8(%r10,%rdi) \n\t"
---

I see that in (b) alignment haven't been done as "movsd 40(%rsp), %xmm0" has been called. Moreover in (c) none of the SSE2 alignment instructions like movaps/movapd/movdqa or movups/movupd/movdqu are being called. Probably since only three parameters(X, Y, Z)exist here, could be the reason.

Suggestionsneeded:
(i) Can call of "movsd 40(%rsp), %xmm0" is correct from optimization point of view or it should be replaced with alignment SSE instructions call?

(ii) Could above patterns for (a), (b)& (c)be more optimized (speed-up) with some other SSE instructions OR replaced by SSE3 or SSSE3 instructions. If YES, can a pattern of SSE3/SSSE3 which instructions be used to replace above SSE2 instructions?

(iii) Since here the algorithm has 3 parameters and asm beingrepresented only for these 3 parameters. Do I need to generate a dummy asm representation of instructions for 4th. parameter (say W) which has void contents to maintain the DP FP alignment and effective vectorization?

~BR

srimks · ‎09-04-2009

In continuation, didhad togenerate asm for algorithmof X, Y, Z parameters since the original C/C++ codehas been writtenin such a way that it fails to add address MCA (multi-core achitecture) design needs which means if I have 4th. parameter as a local scopewithin the file than optimization can be done by taking care of alignment and DP FP 2 or 4 vectorization.

So looking for some suggestions for above (i), (ii) and (iii) queries.