- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Given an flow of SSE2 instructions on Linux x86_64 Intel 5345 processor as below -
---------------(a)--------------
"movaps %xmm5, %xmm12 \n\t"
"mulsd %xmm15, %xmm12 \n\t"
"addsd %xmm2, %xmm12 \n\t"
"movaps %xmm9, %xmm0 \n\t"
"mulsd %xmm14, %xmm0 \n\t"
"addsd %xmm0, %xmm12 \n\t"
"movaps %xmm11, %xmm0 \n\t"
"mulsd %xmm13, %xmm0 \n\t"
"addsd %xmm0, %xmm12 \n\t"
"cvtsd2ss %xmm12, %xmm12 \n\t"
"movss %xmm12, (%r10,%rdi) \n\t"
----------------------------------
for section of code as -
-------------
crd[apple] = (double)crdhello + d * k + d * k + d * k;
-------------
The above pattern is for "d * k" followed by "d * k" and finally by "d * k" respectively.
Similarly for -
------------------------(b)-------------------
crd[apple] = (double)crdhello + d * k + d * k + d * k;
whose respective pattern of Inline asm is -
--------------
"movsd 40(%rsp), %xmm0 \n\t"
"mulsd %xmm15, %xmm0 \n\t"
"addsd%xmm4, %xmm0 \n\t"
"movaps %xmm6, %xmm12 \n\t"
"mulsd %xmm14, %xmm12 \n\t"
"addsd %xmm12, %xmm0 \n\t"
"movaps %xmm7, %xmm12 \n\t"
"mulsd %xmm13, %xmm12 \n\t"
"addsd %xmm12, %xmm0 \n\t"
"cvtsd2ss %xmm0, %xmm0 \n\t"
"movss %xmm0, 4(%r10,%rdi) \n\t"
---------------------------
and for the last pattern which is -
-------------(c)------------
crd[apple] = (double)crdhello + d * k + d * k + d * k;
------
the Inline asm is -
------------------------------
"mulsd %xmm8, %xmm15 \n\t"
"addsd %xmm3, %xmm15 \n\t"
"mulsd %xmm10, %xmm14\n\t"
"addsd %xmm14, %xmm15 \n\t"
"mulsd %xmm1, %xmm13 \n\t"
"addsd %xmm13, %xmm15 \n\t"
"cvtsd2ss %xmm15, %xmm13 \n\t"
"movss %xmm13, 8(%r10,%rdi) \n\t"
---
I see that in (b) alignment haven't been done as "movsd 40(%rsp), %xmm0" has been called. Moreover in (c) none of the SSE2 alignment instructions like movaps/movapd/movdqa or movups/movupd/movdqu are being called. Probably since only three parameters(X, Y, Z)exist here, could be the reason.
Suggestionsneeded:
(i) Can call of "movsd 40(%rsp), %xmm0" is correct from optimization point of view or it should be replaced with alignment SSE instructions call?
(ii) Could above patterns for (a), (b)& (c)be more optimized (speed-up) with some other SSE instructions OR replaced by SSE3 or SSSE3 instructions. If YES, can a pattern of SSE3/SSSE3 which instructions be used to replace above SSE2 instructions?
(iii) Since here the algorithm has 3 parameters and asm beingrepresented only for these 3 parameters. Do I need to generate a dummy asm representation of instructions for 4th. parameter (say W) which has void contents to maintain the DP FP alignment and effective vectorization?
~BR
---------------(a)--------------
"movaps %xmm5, %xmm12 \n\t"
"mulsd %xmm15, %xmm12 \n\t"
"addsd %xmm2, %xmm12 \n\t"
"movaps %xmm9, %xmm0 \n\t"
"mulsd %xmm14, %xmm0 \n\t"
"addsd %xmm0, %xmm12 \n\t"
"movaps %xmm11, %xmm0 \n\t"
"mulsd %xmm13, %xmm0 \n\t"
"addsd %xmm0, %xmm12 \n\t"
"cvtsd2ss %xmm12, %xmm12 \n\t"
"movss %xmm12, (%r10,%rdi) \n\t"
----------------------------------
for section of code as -
-------------
crd[apple]
The above pattern is for "d
Similarly for -
------------------------(b)-------------------
crd[apple]
whose respective pattern of Inline asm is -
--------------
"movsd 40(%rsp), %xmm0 \n\t"
"mulsd %xmm15, %xmm0 \n\t"
"addsd%xmm4, %xmm0 \n\t"
"movaps %xmm6, %xmm12 \n\t"
"mulsd %xmm14, %xmm12 \n\t"
"addsd %xmm12, %xmm0 \n\t"
"movaps %xmm7, %xmm12 \n\t"
"mulsd %xmm13, %xmm12 \n\t"
"addsd %xmm12, %xmm0 \n\t"
"cvtsd2ss %xmm0, %xmm0 \n\t"
"movss %xmm0, 4(%r10,%rdi) \n\t"
---------------------------
and for the last pattern which is -
-------------(c)------------
crd[apple]
------
the Inline asm is -
------------------------------
"mulsd %xmm8, %xmm15 \n\t"
"addsd %xmm3, %xmm15 \n\t"
"mulsd %xmm10, %xmm14\n\t"
"addsd %xmm14, %xmm15 \n\t"
"mulsd %xmm1, %xmm13 \n\t"
"addsd %xmm13, %xmm15 \n\t"
"cvtsd2ss %xmm15, %xmm13 \n\t"
"movss %xmm13, 8(%r10,%rdi) \n\t"
---
I see that in (b) alignment haven't been done as "movsd 40(%rsp), %xmm0" has been called. Moreover in (c) none of the SSE2 alignment instructions like movaps/movapd/movdqa or movups/movupd/movdqu are being called. Probably since only three parameters(X, Y, Z)exist here, could be the reason.
Suggestionsneeded:
(i) Can call of "movsd 40(%rsp), %xmm0" is correct from optimization point of view or it should be replaced with alignment SSE instructions call?
(ii) Could above patterns for (a), (b)& (c)be more optimized (speed-up) with some other SSE instructions OR replaced by SSE3 or SSSE3 instructions. If YES, can a pattern of SSE3/SSSE3 which instructions be used to replace above SSE2 instructions?
(iii) Since here the algorithm has 3 parameters and asm beingrepresented only for these 3 parameters. Do I need to generate a dummy asm representation of instructions for 4th. parameter (say W) which has void contents to maintain the DP FP alignment and effective vectorization?
~BR
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In continuation, didhad togenerate asm for algorithmof X, Y, Z parameters since the original C/C++ codehas been writtenin such a way that it fails to add address MCA (multi-core achitecture) design needs which means if I have 4th. parameter as a local scopewithin the file than optimization can be done by taking care of alignment and DP FP 2 or 4 vectorization.
So looking for some suggestions for above (i), (ii) and (iii) queries.
So looking for some suggestions for above (i), (ii) and (iii) queries.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page