Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Optimizing SSE2 code and beyond...

srimks
New Contributor II
736 Views
Given an flow of SSE2 instructions on Linux x86_64 Intel 5345 processor as below -


---------------(a)--------------
"movaps %xmm5, %xmm12 \n\t"
"mulsd %xmm15, %xmm12 \n\t"
"addsd %xmm2, %xmm12 \n\t"

"movaps %xmm9, %xmm0 \n\t"
"mulsd %xmm14, %xmm0 \n\t"
"addsd %xmm0, %xmm12 \n\t"

"movaps %xmm11, %xmm0 \n\t"
"mulsd %xmm13, %xmm0 \n\t"
"addsd %xmm0, %xmm12 \n\t"

"cvtsd2ss %xmm12, %xmm12 \n\t"
"movss %xmm12, (%r10,%rdi) \n\t"
----------------------------------

for section of code as -
-------------
crd[apple] = (double)crdhello + d * k + d * k + d * k;
-------------

The above pattern is for "d * k" followed by "d * k" and finally by "d * k" respectively.

Similarly for -

------------------------(b)-------------------

crd[apple] = (double)crdhello + d * k + d * k + d * k;

whose respective pattern of Inline asm is -

--------------
"movsd 40(%rsp), %xmm0 \n\t"
"mulsd %xmm15, %xmm0 \n\t"
"addsd%xmm4, %xmm0 \n\t"

"movaps %xmm6, %xmm12 \n\t"
"mulsd %xmm14, %xmm12 \n\t"
"addsd %xmm12, %xmm0 \n\t"

"movaps %xmm7, %xmm12 \n\t"
"mulsd %xmm13, %xmm12 \n\t"
"addsd %xmm12, %xmm0 \n\t"

"cvtsd2ss %xmm0, %xmm0 \n\t"
"movss %xmm0, 4(%r10,%rdi) \n\t"
---------------------------

and for the last pattern which is -

-------------(c)------------
crd[apple] = (double)crdhello + d * k + d * k + d * k;

------

the Inline asm is -

------------------------------
"mulsd %xmm8, %xmm15 \n\t"
"addsd %xmm3, %xmm15 \n\t"

"mulsd %xmm10, %xmm14\n\t"
"addsd %xmm14, %xmm15 \n\t"

"mulsd %xmm1, %xmm13 \n\t"
"addsd %xmm13, %xmm15 \n\t"

"cvtsd2ss %xmm15, %xmm13 \n\t"
"movss %xmm13, 8(%r10,%rdi) \n\t"
---

I see that in (b) alignment haven't been done as "movsd 40(%rsp), %xmm0" has been called. Moreover in (c) none of the SSE2 alignment instructions like movaps/movapd/movdqa or movups/movupd/movdqu are being called. Probably since only three parameters(X, Y, Z)exist here, could be the reason.

Suggestionsneeded:
(i)
Can call of "movsd 40(%rsp), %xmm0" is correct from optimization point of view or it should be replaced with alignment SSE instructions call?

(ii) Could above patterns for (a), (b)& (c)be more optimized (speed-up) with some other SSE instructions OR replaced by SSE3 or SSSE3 instructions. If YES, can a pattern of SSE3/SSSE3 which instructions be used to replace above SSE2 instructions?

(iii) Since here the algorithm has 3 parameters and asm beingrepresented only for these 3 parameters. Do I need to generate a dummy asm representation of instructions for 4th. parameter (say W) which has void contents to maintain the DP FP alignment and effective vectorization?


~BR
0 Kudos
1 Reply
srimks
New Contributor II
736 Views
In continuation, didhad togenerate asm for algorithmof X, Y, Z parameters since the original C/C++ codehas been writtenin such a way that it fails to add address MCA (multi-core achitecture) design needs which means if I have 4th. parameter as a local scopewithin the file than optimization can be done by taking care of alignment and DP FP 2 or 4 vectorization.

So looking for some suggestions for above (i), (ii) and (iii) queries.
0 Kudos
Reply