- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote a class to perform basic arithmetic on 3-D vectors, and the SSE instructions seem to improve performance.
The class is simple(some part of the code ignored):
One of its member function is simple, too:
When I try it with ICC, I disappointedly found a huge of block of "movss" and "addss" instructions, not the expecting "addps".
So what's the problem?
The class is simple(some part of the code ignored):
[cpp]struct Vector3DI added a extra member "t", so that I can treat a structure of Vector3D as a separate XMMWORD.
{
float x, y, z, t;
} __attribute__((aligned (16)));[/cpp]
One of its member function is simple, too:
[cpp] Vector3D& Vector3D:: operator += (Vector3D const & ano)To make it bit-copyable, No copy-constructor nor assignment operator is declared.
{
x += ano.x;
y += ano.y;
z += ano.z;
t += ano.t;
return *this;
}[/cpp]
When I try it with ICC, I disappointedly found a huge of block of "movss" and "addss" instructions, not the expecting "addps".
So what's the problem?
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - hpsmouse
Well... the problem seems to become complicated...
It seems ICC dare not vectorize the code unless I force it using "#pragma vector always"
It seems ICC dare not vectorize the code unless I force it using "#pragma vector always"
Hi,
Personal note: Compilers are amongst the most complicated parts of software out there. Engineering a good compiler needs a lot of black magic, pardon, experience :-).
But back to your problem: Dale seems to be right. If you mind adding the pragma to your code, you can still use SSE intrinsic and write C code that directly maps to assembly code in the way you want it:
[cpp]Vector3D& Vector3D::operator += (Vector3D const& ano) { __m128 *this128 = (__m128 *) this; __m128 *ano128 = (__m128 *) &ano; *this128 = _mm_add_ps(*this128, *ano128); return *this; } [/cpp]This is compiled by the compiler to the following assembly code fragment:
[cpp]0000000000400b60 <_ZN8Vector3DpLERKS_>: 400b60: 0f 28 07 movaps (%rdi),%xmm0 400b63: 0f 58 06 addps (%rsi),%xmm0 400b66: 0f 29 07 movaps %xmm0,(%rdi) 400b69: 48 89 f8 mov %rdi,%rax 400b6c: c3 retq 400b6d: 48 89 f6 mov %rsi,%rsi [/cpp]
Cheers,
-michael
Link Copied
19 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - hpsmouse
When I try it with ICC, I disappointedly found a huge of block of "movss" and "addss" instructions, not the expecting "addps".
So what's the problem?
So what's the problem?
Hi!
I had a look at the code and it looks like ICC should generate SSE-enabled code for the snippets you have posted. At least my ICC on my box does it:
00000000004009f0 <_ZN8Vector3DpLERKS_>:
movaps (%rdi),%xmm0
addps (%rsi),%xmm0
movaps %xmm0,(%rdi)
mov %rdi,%rax
retq
mov %rsi,%rsi
As you only provided the definition of "operator+=", I suspect that you've declared the function as a virtual function. In this case, the floats in your struct are no longer aligned (there's a set of pointers at the beginning of the struct needed for C++ features). Hence, the compiler cannot generate SSE code for your code. If there's a virtual at the declaration of "operator+=", please remove it and the compiler will do want you expect.
Hope that helps.
Cheers,
-michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Michael Klemm, Intel
As you only provided the definition of "operator+=", I suspect that you've declared the function as a virtual function. In this case, the floats in your struct are no longer aligned (there's a set of pointers at the beginning of the struct needed for C++ features). Hence, the compiler cannot generate SSE code for your code. If there's a virtual at the declaration of "operator+=", please remove it and the compiler will do want you expect.
Hope that helps.
Cheers,
-michael
Hope that helps.
Cheers,
-michael
Hello,
Thanks for your help.
But the fact is I designed it as a POD type, without any virtual functions, and is bit-copyable.
Just now I did a test, which cannot be more simple:
[shell]~/projects/atest$ cat test.hppAnd this is the result:
struct Vector3D
{
float x, y, z, t;
Vector3D& operator += (Vector3D const& ano);
} __attribute__((aligned(16)));
~/projects/atest$ cat test.cpp
#include "test.hpp"
Vector3D& Vector3D::operator += (Vector3D const& ano)
{
x += ano.x;
y += ano.y;
z += ano.z;
t += ano.t;
return *this;
}
~/projects/atest$ icpc -Wall -c test.cpp[/shell]
[plain]Disassembly of section .text:By the way, my system is ubuntu 9.04 64-bit version on a Core2 P8400, and ICC version is 11.1.059. I installed both the IA-32 and the Intel 64 version, and used the latter one for the test.
0000000000000000 <_ZN8Vector3DpLERKS_>:
0: f3 0f 10 07 movss xmm0,DWORD PTR [rdi]
4: f3 0f 58 06 addss xmm0,DWORD PTR [rsi]
8: f3 0f 10 4f 04 movss xmm1,DWORD PTR [rdi+0x4]
d: f3 0f 10 57 08 movss xmm2,DWORD PTR [rdi+0x8]
12: f3 0f 10 5f 0c movss xmm3,DWORD PTR [rdi+0xc]
17: f3 0f 11 07 movss DWORD PTR [rdi],xmm0
1b: f3 0f 58 4e 04 addss xmm1,DWORD PTR [rsi+0x4]
20: f3 0f 11 4f 04 movss DWORD PTR [rdi+0x4],xmm1
25: f3 0f 58 56 08 addss xmm2,DWORD PTR [rsi+0x8]
2a: 48 89 f8 mov rax,rdi
2d: f3 0f 11 57 08 movss DWORD PTR [rdi+0x8],xmm2
32: f3 0f 58 5e 0c addss xmm3,DWORD PTR [rsi+0xc]
37: f3 0f 11 5f 0c movss DWORD PTR [rdi+0xc],xmm3
3c: c3 ret
3d: 48 89 f6 mov rsi,rsi[/plain]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, the bad connection made me double-clicked on the post button, resulting in two replies...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
OK, here we are... If I use the code as you have supplied it in your last post, my version of ICC also generates the assembly code you've seen. If I, however, add a simple main method
Cheers,
-michael
OK, here we are... If I use the code as you have supplied it in your last post, my version of ICC also generates the assembly code you've seen. If I, however, add a simple main method
[cpp]int main(int argc, char** argv) { Vector3D v1 = {1.0, 1.0, 1.0, -1.0}; Vector3D v2 = {2.0, 2.0, 2.0, -2.0}; v2 += v1; } [/cpp]I get the vectorized code. Looks like the compiler only optimizes the code if the operator is actually called somewhere in the code. I guess that you'll have two options to work around this: (a) place a static dummy function that adds to vectors, (b) use an instrinsic function to explicitly enable an SSE addps in the code.
Cheers,
-michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Michael Klemm, Intel
Hi!
OK, here we are... If I use the code as you have supplied it in your last post, my version of ICC also generates the assembly code you've seen. If I, however, add a simple main method
Cheers,
-michael
OK, here we are... If I use the code as you have supplied it in your last post, my version of ICC also generates the assembly code you've seen. If I, however, add a simple main method
[cpp]int main(int argc, char** argv) {I get the vectorized code. Looks like the compiler only optimizes the code if the operator is actually called somewhere in the code. I guess that you'll have two options to work around this: (a) place a static dummy function that adds to vectors, (b) use an instrinsic function to explicitly enable an SSE addps in the code.
Vector3D v1 = {1.0, 1.0, 1.0, -1.0};
Vector3D v2 = {2.0, 2.0, 2.0, -2.0};
v2 += v1;
}
[/cpp]
Cheers,
-michael
Actually I think what's going on in this case is that the compiler is inlining the "+" in this case, and can therefore see that v1 and v2 do not overlap. Using the "-vec-report3" option in the original case, it complains about numerous assumed dependences. If you add the "restrict" keyword (and the requisite "-restrict" command line option) that fixes the dependence problems, but it complains about vectorization may not be efficient. Adding "#pragma vector always" fixes that and it converts the adds to a single addps:
$ cat test2.cpp
#include "test.hpp"
Vector3D& Vector3D::operator += (Vector3D const& restrict ano)
{
#pragma vector always
x += ano.x;
y += ano.y;
z += ano.z;
t += ano.t;
return *this;
}
$ icc -S -restrict test2.cpp -vec-report3
test2.cpp(6): (col. 3) remark: BLOCK WAS VECTORIZED.
$ fgrep add test2.s
addps %xmm0, %xmm1 #6.3
$
Dale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well... the problem seems to become complicated...
I tried to change "main()" into "fun()", and the compiler no longer does the vectorization.
To force it to do the computation, I moved v1 and v2 to global variable area, and although the code in the operator += is vectorized, the inlined one in the main() function is still not...
I tried to change "main()" into "fun()", and the compiler no longer does the vectorization.
To force it to do the computation, I moved v1 and v2 to global variable area, and although the code in the operator += is vectorized, the inlined one in the main() function is still not...
[cpp]#include "test.hpp" Vector3D& Vector3D::operator += (Vector3D const& ano) { x += ano.x; y += ano.y; z += ano.z; t += ano.t; return *this; } Vector3D v1 = {1.0, 1.0, 1.0, -1.0}; Vector3D v2 = {2.0, 2.0, 2.0, -2.0}; int main(int argc, char** argv) { v2 += v1; return 0; }[/cpp]
[plain]Disassembly of section .text: 0000000000000000It seems ICC dare not vectorize the code unless I force it using "#pragma vector always": 0: 55 push rbp 1: 48 89 e5 mov rbp,rsp 4: 48 83 e4 80 and rsp,0xffffffffffffff80 8: 48 81 ec 80 00 00 00 sub rsp,0x80 f: bf 03 00 00 00 mov edi,0x3 14: e8 00 00 00 00 call 19 15: R_X86_64_PC32 __intel_new_proc_init-0x4 19: f3 0f 10 05 00 00 00 movss xmm0,DWORD PTR [rip+0x0] # 21 20: 00 1d: R_X86_64_PC32 v2-0x4 21: f3 0f 58 05 00 00 00 addss xmm0,DWORD PTR [rip+0x0] # 29 28: 00 25: R_X86_64_PC32 v1-0x4 29: f3 0f 10 0d 00 00 00 movss xmm1,DWORD PTR [rip+0x0] # 31 30: 00 2d: R_X86_64_PC32 v2 31: f3 0f 58 0d 00 00 00 addss xmm1,DWORD PTR [rip+0x0] # 39 38: 00 35: R_X86_64_PC32 v1 39: f3 0f 10 15 00 00 00 movss xmm2,DWORD PTR [rip+0x0] # 41 40: 00 3d: R_X86_64_PC32 v2+0x4 41: f3 0f 58 15 00 00 00 addss xmm2,DWORD PTR [rip+0x0] # 49 48: 00 45: R_X86_64_PC32 v1+0x4 49: f3 0f 10 1d 00 00 00 movss xmm3,DWORD PTR [rip+0x0] # 51 50: 00 4d: R_X86_64_PC32 v2+0x8 51: f3 0f 58 1d 00 00 00 addss xmm3,DWORD PTR [rip+0x0] # 59 58: 00 55: R_X86_64_PC32 v1+0x8 59: f3 0f 11 05 00 00 00 movss DWORD PTR [rip+0x0],xmm0 # 61 60: 00 5d: R_X86_64_PC32 v2-0x4 61: f3 0f 11 0d 00 00 00 movss DWORD PTR [rip+0x0],xmm1 # 69 68: 00 65: R_X86_64_PC32 v2 69: f3 0f 11 15 00 00 00 movss DWORD PTR [rip+0x0],xmm2 # 71 70: 00 6d: R_X86_64_PC32 v2+0x4 71: f3 0f 11 1d 00 00 00 movss DWORD PTR [rip+0x0],xmm3 # 79 78: 00 75: R_X86_64_PC32 v2+0x8 79: 33 c0 xor eax,eax 7b: 0f ae 1c 24 stmxcsr DWORD PTR [rsp] 7f: 81 0c 24 40 80 00 00 or DWORD PTR [rsp],0x8040 86: 0f ae 14 24 ldmxcsr DWORD PTR [rsp] 8a: 48 89 ec mov rsp,rbp 8d: 5d pop rbp 8e: c3 ret 8f: 90 nop 0000000000000090 <_ZN8Vector3DpLERKS_>: 90: 0f 28 07 movaps xmm0,XMMWORD PTR [rdi] 93: 0f 58 06 addps xmm0,XMMWORD PTR [rsi] 96: 0f 29 07 movaps XMMWORD PTR [rdi],xmm0 99: 48 89 f8 mov rax,rdi 9c: c3 ret 9d: 48 89 f6 mov rsi,rsi [/plain]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - hpsmouse
Well... the problem seems to become complicated...
It seems ICC dare not vectorize the code unless I force it using "#pragma vector always"
It seems ICC dare not vectorize the code unless I force it using "#pragma vector always"
Hi,
Personal note: Compilers are amongst the most complicated parts of software out there. Engineering a good compiler needs a lot of black magic, pardon, experience :-).
But back to your problem: Dale seems to be right. If you mind adding the pragma to your code, you can still use SSE intrinsic and write C code that directly maps to assembly code in the way you want it:
[cpp]Vector3D& Vector3D::operator += (Vector3D const& ano) { __m128 *this128 = (__m128 *) this; __m128 *ano128 = (__m128 *) &ano; *this128 = _mm_add_ps(*this128, *ano128); return *this; } [/cpp]This is compiled by the compiler to the following assembly code fragment:
[cpp]0000000000400b60 <_ZN8Vector3DpLERKS_>: 400b60: 0f 28 07 movaps (%rdi),%xmm0 400b63: 0f 58 06 addps (%rsi),%xmm0 400b66: 0f 29 07 movaps %xmm0,(%rdi) 400b69: 48 89 f8 mov %rdi,%rax 400b6c: c3 retq 400b6d: 48 89 f6 mov %rsi,%rsi [/cpp]
Cheers,
-michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Michael Klemm (Intel)
Hi,
Personal note: Compilers are amongst the most complicated parts of software out there. Engineering a good compiler needs a lot of black magic, pardon, experience :-).
But back to your problem: Dale seems to be right. If you mind adding the pragma to your code, you can still use SSE intrinsic and write C code that directly maps to assembly code in the way you want it:
Cheers,
-michael
Yes, so this is likely to be the final solution, and I still have to write specialized code for different platforms.
That's OK. It can't be a big problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And thanks for your help again~
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Would it be possible for the compiler team to add a directive to work around this kind of problem when the dependencies can't be resolved explicitly? The example presented here was small, but in practice, this could have been hundreds of lines of code that would have been hard to convert to SSE intrinsics directly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jeff_keasler
Would it be possible for the compiler team to add a directive to work around this kind of problem when the dependencies can't be resolved explicitly? The example presented here was small, but in practice, this could have been hundreds of lines of code that would have been hard to convert to SSE intrinsics directly.
Well, there may be one already. Can you be more specific about exactly what you want the pragma to do?
Dale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dale Schouten (Intel)
Well, there may be one already. Can you be more specific about exactly what you want the pragma to do?
Dale
Well, what I'd really like is to understand is why this doesn't work (sorry for the overkill declaration of Vector3D_aligned):
struct Vector3D
{
float x, y, z, t;
Vector3D& operator += (Vector3D const& ano);
} __attribute__((aligned (16)));
typedef Vector3D __attribute__ ((aligned (16))) Vector3D_aligned ;
Vector3D& Vector3D::operator += (Vector3D const& ano)
{
Vector3D_aligned * restrict surrogateThis = this ;
const Vector3D_aligned * restrict surrogateAno = &ano ;
__assume_aligned(surrogateThis, 16) ;
__assume_aligned(surrogateAno, 16) ;
surrogateThis->x += surrogateAno->x;
surrogateThis->y += surrogateAno->y;
surrogateThis->z += surrogateAno->z;
surrogateThis->t += surrogateAno->t;
return *this;
}
Vector3D v1 = {1.0, 1.0, 1.0, -1.0};
Vector3D v2 = {2.0, 2.0, 2.0, -2.0};
int main(int argc, char** argv) {
v2 += v1;
return 0;
}
When I compile the above -S -O3 -restrict, I get a subroutine generated that never gets used:
# -- Begin _ZN8Vector3DpLERKS_
# mark_begin;
.align 16,0x90
.globl _ZN8Vector3DpLERKS_
_ZN8Vector3DpLERKS_:
# parameter 1: %rdi
# parameter 2: %rsi
..B2.1: # Preds ..B2.0
..___tag_value__ZN8Vector3DpLERKS_.10: #11.1
movaps (%rdi), %xmm0 #16.3
addps (%rsi), %xmm0 #16.3
movaps %xmm0, (%rdi) #16.3
movq %rdi, %rax #20.10
ret #20.10
.align 16,0x90
..___tag_value__ZN8Vector3DpLERKS_.11: #
# LOE
# mark_end;
.type _ZN8Vector3DpLERKS_,@function
.size _ZN8Vector3DpLERKS_,.-_ZN8Vector3DpLERKS_
.data
# -- End _ZN8Vector3DpLERKS_
Thanks,
-Jeff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jeff_keasler
Well, what I'd really like is to understand is why this doesn't work (sorry for the overkill declaration of Vector3D_aligned):
Well, as near as I can tell it does work:
$ icpc -c -restrict -vec-report3 test3.cpp
test3.cpp(14): (col. 5) remark: BLOCK WAS VECTORIZED.
$
Where test3.cpp is the code above and in my case line 14 is the beginning of the adds in operator '+='. The reason for the unused function is that it gets inlined into main, but because it's visible outside of this object it needs to exist, in case you linked in another module that calls it.
Does that answer your question?
Dale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dale Schouten (Intel)
Well, as near as I can tell it does work:
$ icpc -c -restrict -vec-report3 test3.cpp
test3.cpp(14): (col. 5) remark: BLOCK WAS VECTORIZED.
$
Where test3.cpp is the code above and in my case line 14 is the beginning of the adds in operator '+='. The reason for the unused function is that it gets inlined into main, but because it's visible outside of this object it needs to exist, in case you linked in another module that calls it.
Does that answer your question?
Dale
I'm using icpc version 11.1.064
Change the -c to -S in your compile line to see the assembly language output. In the main routine, it is vectorizing to produce the code in response #2 (inefficient), yet creating the assembly subroutine as described in response #12 that never gets called. In other words, it generates efficient assembly for the C code in response #12, but never calls it or inlines it into the main routine.
I think this is a bug in the compiler. If it were fixed, it would address the original problem hpsmouse was seeing.
-Jeff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jeff_keasler
I'm using icpc version 11.1.064
Change the -c to -S in your compile line to see the assembly language output. In the main routine, it is vectorizing to produce the code in response #2 (inefficient), yet creating the assembly subroutine as described in response #12 that never gets called. In other words, it generates efficient assembly for the C code in response #12, but never calls it or inlines it into the main routine.
I think this is a bug in the compiler. If it were fixed, it would address the original problem hpsmouse was seeing.
-Jeff
Ahh, I see. I had glanced at the asm, but not examined it closely enough. It looks like I'm seeing the same thing. Let me get back to you on that.
Dale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BTW, I did end up filing an issue on this, you can refer to as cq149191 for future reference. I'll try to post back here when I get any information on it.
Thanks!
Dale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Really thanks for your caring aboutthis topic for such a long time!
Now I believe this may be more a bug now, but there is something strange. As I said in #6, if I change the function "main" into function "fun", or some other names else, even the vectorization in the operator += is not made...
BTW, what does that "cq149191" mean? I'm new here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting Dale Schouten (Intel)
BTW, I did end up filing an issue on this, you can refer to as cq149191 for future reference. I'll try to post back here when I get any information on it.
Thanks!
Dale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting hpsmouse
BTW, what does that "cq149191" mean?
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page