Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!

Optimizing SSE2 code and beyond...

New Contributor II
Given an flow of SSE2 instructions on Linux x86_64 Intel 5345 processor as below -

"movaps %xmm5, %xmm12 \n\t"
"mulsd %xmm15, %xmm12 \n\t"
"addsd %xmm2, %xmm12 \n\t"

"movaps %xmm9, %xmm0 \n\t"
"mulsd %xmm14, %xmm0 \n\t"
"addsd %xmm0, %xmm12 \n\t"

"movaps %xmm11, %xmm0 \n\t"
"mulsd %xmm13, %xmm0 \n\t"
"addsd %xmm0, %xmm12 \n\t"

"cvtsd2ss %xmm12, %xmm12 \n\t"
"movss %xmm12, (%r10,%rdi) \n\t"

for section of code as -
crd[apple] = (double)crdhello + d * k + d * k + d * k;

The above pattern is for "d * k" followed by "d * k" and finally by "d * k" respectively.

Similarly for -


crd[apple] = (double)crdhello + d * k + d * k + d * k;

whose respective pattern of Inline asm is -

"movsd 40(%rsp), %xmm0 \n\t"
"mulsd %xmm15, %xmm0 \n\t"
"addsd%xmm4, %xmm0 \n\t"

"movaps %xmm6, %xmm12 \n\t"
"mulsd %xmm14, %xmm12 \n\t"
"addsd %xmm12, %xmm0 \n\t"

"movaps %xmm7, %xmm12 \n\t"
"mulsd %xmm13, %xmm12 \n\t"
"addsd %xmm12, %xmm0 \n\t"

"cvtsd2ss %xmm0, %xmm0 \n\t"
"movss %xmm0, 4(%r10,%rdi) \n\t"

and for the last pattern which is -

crd[apple] = (double)crdhello + d * k + d * k + d * k;


the Inline asm is -

"mulsd %xmm8, %xmm15 \n\t"
"addsd %xmm3, %xmm15 \n\t"

"mulsd %xmm10, %xmm14\n\t"
"addsd %xmm14, %xmm15 \n\t"

"mulsd %xmm1, %xmm13 \n\t"
"addsd %xmm13, %xmm15 \n\t"

"cvtsd2ss %xmm15, %xmm13 \n\t"
"movss %xmm13, 8(%r10,%rdi) \n\t"

I see that in (b) alignment haven't been done as "movsd 40(%rsp), %xmm0" has been called. Moreover in (c) none of the SSE2 alignment instructions like movaps/movapd/movdqa or movups/movupd/movdqu are being called. Probably since only three parameters(X, Y, Z)exist here, could be the reason.

Can call of "movsd 40(%rsp), %xmm0" is correct from optimization point of view or it should be replaced with alignment SSE instructions call?

(ii) Could above patterns for (a), (b)& (c)be more optimized (speed-up) with some other SSE instructions OR replaced by SSE3 or SSSE3 instructions. If YES, can a pattern of SSE3/SSSE3 which instructions be used to replace above SSE2 instructions?

(iii) Since here the algorithm has 3 parameters and asm beingrepresented only for these 3 parameters. Do I need to generate a dummy asm representation of instructions for 4th. parameter (say W) which has void contents to maintain the DP FP alignment and effective vectorization?

0 Kudos
1 Reply
New Contributor II
In continuation, didhad togenerate asm for algorithmof X, Y, Z parameters since the original C/C++ codehas been writtenin such a way that it fails to add address MCA (multi-core achitecture) design needs which means if I have 4th. parameter as a local scopewithin the file than optimization can be done by taking care of alignment and DP FP 2 or 4 vectorization.

So looking for some suggestions for above (i), (ii) and (iii) queries.