Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!

## Optimizing SSE2 code and beyond...

New Contributor II
183 Views
Given an flow of SSE2 instructions on Linux x86_64 Intel 5345 processor as below -

---------------(a)--------------
"movaps %xmm5, %xmm12 \n\t"
"mulsd %xmm15, %xmm12 \n\t"

"movaps %xmm9, %xmm0 \n\t"
"mulsd %xmm14, %xmm0 \n\t"

"movaps %xmm11, %xmm0 \n\t"
"mulsd %xmm13, %xmm0 \n\t"

"cvtsd2ss %xmm12, %xmm12 \n\t"
"movss %xmm12, (%r10,%rdi) \n\t"
----------------------------------

for section of code as -
-------------
crd[apple] = (double)crdhello + d * k + d * k + d * k;
-------------

The above pattern is for "d * k" followed by "d * k" and finally by "d * k" respectively.

Similarly for -

------------------------(b)-------------------

crd[apple] = (double)crdhello + d * k + d * k + d * k;

whose respective pattern of Inline asm is -

--------------
"movsd 40(%rsp), %xmm0 \n\t"
"mulsd %xmm15, %xmm0 \n\t"

"movaps %xmm6, %xmm12 \n\t"
"mulsd %xmm14, %xmm12 \n\t"

"movaps %xmm7, %xmm12 \n\t"
"mulsd %xmm13, %xmm12 \n\t"

"cvtsd2ss %xmm0, %xmm0 \n\t"
"movss %xmm0, 4(%r10,%rdi) \n\t"
---------------------------

and for the last pattern which is -

-------------(c)------------
crd[apple] = (double)crdhello + d * k + d * k + d * k;

------

the Inline asm is -

------------------------------
"mulsd %xmm8, %xmm15 \n\t"

"mulsd %xmm10, %xmm14\n\t"

"mulsd %xmm1, %xmm13 \n\t"

"cvtsd2ss %xmm15, %xmm13 \n\t"
"movss %xmm13, 8(%r10,%rdi) \n\t"
---

I see that in (b) alignment haven't been done as "movsd 40(%rsp), %xmm0" has been called. Moreover in (c) none of the SSE2 alignment instructions like movaps/movapd/movdqa or movups/movupd/movdqu are being called. Probably since only three parameters(X, Y, Z)exist here, could be the reason.

Suggestionsneeded:
(i)
Can call of "movsd 40(%rsp), %xmm0" is correct from optimization point of view or it should be replaced with alignment SSE instructions call?

(ii) Could above patterns for (a), (b)& (c)be more optimized (speed-up) with some other SSE instructions OR replaced by SSE3 or SSSE3 instructions. If YES, can a pattern of SSE3/SSSE3 which instructions be used to replace above SSE2 instructions?

(iii) Since here the algorithm has 3 parameters and asm beingrepresented only for these 3 parameters. Do I need to generate a dummy asm representation of instructions for 4th. parameter (say W) which has void contents to maintain the DP FP alignment and effective vectorization?

~BR