Community
cancel
Showing results for 
Search instead for 
Did you mean: 
hpsmouse
Beginner
155 Views

Is it so hard to make ICC generate SSE packed arithmetic instructions?

Jump to solution
I wrote a class to perform basic arithmetic on 3-D vectors, and the SSE instructions seem to improve performance.
The class is simple(some part of the code ignored):
[cpp]struct Vector3D
{
float x, y, z, t;
} __attribute__((aligned (16)));[/cpp]
I added a extra member "t", so that I can treat a structure of Vector3D as a separate XMMWORD.
One of its member function is simple, too:
[cpp]  Vector3D& Vector3D:: operator += (Vector3D const & ano)
{
x += ano.x;
y += ano.y;
z += ano.z;
t += ano.t;
return *this;
}[/cpp]
To make it bit-copyable, No copy-constructor nor assignment operator is declared.
When I try it with ICC, I disappointedly found a huge of block of "movss" and "addss" instructions, not the expecting "addps".
So what's the problem?
0 Kudos
1 Solution
Michael_K_Intel2
Employee
155 Views
Quoting - hpsmouse
Well... the problem seems to become complicated...

It seems ICC dare not vectorize the code unless I force it using "#pragma vector always"

Hi,

Personal note: Compilers are amongst the most complicated parts of software out there. Engineering a good compiler needs a lot of black magic, pardon, experience :-).

But back to your problem: Dale seems to be right. If you mind adding the pragma to your code, you can still use SSE intrinsic and write C code that directly maps to assembly code in the way you want it:

[cpp]Vector3D& Vector3D::operator += (Vector3D const& ano)
{
    __m128 *this128 = (__m128 *) this;
    __m128 *ano128 = (__m128 *) &ano;
    *this128 = _mm_add_ps(*this128, *ano128);
    return *this;
}
[/cpp]
This is compiled by the compiler to the following assembly code fragment:

[cpp]0000000000400b60 <_ZN8Vector3DpLERKS_>:
  400b60:       0f 28 07                movaps (%rdi),%xmm0
  400b63:       0f 58 06                addps  (%rsi),%xmm0
  400b66:       0f 29 07                movaps %xmm0,(%rdi)
  400b69:       48 89 f8                mov    %rdi,%rax
  400b6c:       c3                      retq
  400b6d:       48 89 f6                mov    %rsi,%rsi
[/cpp]

Cheers,
-michael




View solution in original post

19 Replies
Michael_K_Intel2
Employee
155 Views
Quoting - hpsmouse
When I try it with ICC, I disappointedly found a huge of block of "movss" and "addss" instructions, not the expecting "addps".
So what's the problem?

Hi!

I had a look at the code and it looks like ICC should generate SSE-enabled code for the snippets you have posted. At least my ICC on my box does it:

00000000004009f0 <_ZN8Vector3DpLERKS_>:
movaps (%rdi),%xmm0
addps (%rsi),%xmm0
movaps %xmm0,(%rdi)
mov %rdi,%rax
retq
mov %rsi,%rsi

As you only provided the definition of "operator+=", I suspect that you've declared the function as a virtual function. In this case, the floats in your struct are no longer aligned (there's a set of pointers at the beginning of the struct needed for C++ features). Hence, the compiler cannot generate SSE code for your code. If there's a virtual at the declaration of "operator+=", please remove it and the compiler will do want you expect.

Hope that helps.

Cheers,
-michael






hpsmouse
Beginner
155 Views
As you only provided the definition of "operator+=", I suspect that you've declared the function as a virtual function. In this case, the floats in your struct are no longer aligned (there's a set of pointers at the beginning of the struct needed for C++ features). Hence, the compiler cannot generate SSE code for your code. If there's a virtual at the declaration of "operator+=", please remove it and the compiler will do want you expect.

Hope that helps.

Cheers,
-michael

Hello,
Thanks for your help.
But the fact is I designed it as a POD type, without any virtual functions, and is bit-copyable.
Just now I did a test, which cannot be more simple:
[shell]~/projects/atest$ cat test.hpp
struct Vector3D
{
float x, y, z, t;
Vector3D& operator += (Vector3D const& ano);
} __attribute__((aligned(16)));

~/projects/atest$ cat test.cpp
#include "test.hpp"
Vector3D& Vector3D::operator += (Vector3D const& ano)
{
x += ano.x;
y += ano.y;
z += ano.z;
t += ano.t;
return *this;
}

~/projects/atest$ icpc -Wall -c test.cpp[/shell]
And this is the result:
[plain]Disassembly of section .text:

0000000000000000 <_ZN8Vector3DpLERKS_>:
0: f3 0f 10 07 movss xmm0,DWORD PTR [rdi]
4: f3 0f 58 06 addss xmm0,DWORD PTR [rsi]
8: f3 0f 10 4f 04 movss xmm1,DWORD PTR [rdi+0x4]
d: f3 0f 10 57 08 movss xmm2,DWORD PTR [rdi+0x8]
12: f3 0f 10 5f 0c movss xmm3,DWORD PTR [rdi+0xc]
17: f3 0f 11 07 movss DWORD PTR [rdi],xmm0
1b: f3 0f 58 4e 04 addss xmm1,DWORD PTR [rsi+0x4]
20: f3 0f 11 4f 04 movss DWORD PTR [rdi+0x4],xmm1
25: f3 0f 58 56 08 addss xmm2,DWORD PTR [rsi+0x8]
2a: 48 89 f8 mov rax,rdi
2d: f3 0f 11 57 08 movss DWORD PTR [rdi+0x8],xmm2
32: f3 0f 58 5e 0c addss xmm3,DWORD PTR [rsi+0xc]
37: f3 0f 11 5f 0c movss DWORD PTR [rdi+0xc],xmm3
3c: c3 ret
3d: 48 89 f6 mov rsi,rsi[/plain]
By the way, my system is ubuntu 9.04 64-bit version on a Core2 P8400, and ICC version is 11.1.059. I installed both the IA-32 and the Intel 64 version, and used the latter one for the test.
hpsmouse
Beginner
155 Views
Sorry, the bad connection made me double-clicked on the post button, resulting in two replies...
Michael_K_Intel2
Employee
155 Views
Hi!

OK, here we are... If I use the code as you have supplied it in your last post, my version of ICC also generates the assembly code you've seen. If I, however, add a simple main method
[cpp]int main(int argc, char** argv) {
    Vector3D v1 = {1.0, 1.0, 1.0, -1.0};
    Vector3D v2 = {2.0, 2.0, 2.0, -2.0};
    v2 += v1;
}
[/cpp]
I get the vectorized code. Looks like the compiler only optimizes the code if the operator is actually called somewhere in the code. I guess that you'll have two options to work around this: (a) place a static dummy function that adds to vectors, (b) use an instrinsic function to explicitly enable an SSE addps in the code.

Cheers,
-michael

Dale_S_Intel
Employee
155 Views
Hi!

OK, here we are... If I use the code as you have supplied it in your last post, my version of ICC also generates the assembly code you've seen. If I, however, add a simple main method
[cpp]int main(int argc, char** argv) {
Vector3D v1 = {1.0, 1.0, 1.0, -1.0};
Vector3D v2 = {2.0, 2.0, 2.0, -2.0};
v2 += v1;
}
[/cpp]
I get the vectorized code. Looks like the compiler only optimizes the code if the operator is actually called somewhere in the code. I guess that you'll have two options to work around this: (a) place a static dummy function that adds to vectors, (b) use an instrinsic function to explicitly enable an SSE addps in the code.

Cheers,
-michael


Actually I think what's going on in this case is that the compiler is inlining the "+" in this case, and can therefore see that v1 and v2 do not overlap. Using the "-vec-report3" option in the original case, it complains about numerous assumed dependences. If you add the "restrict" keyword (and the requisite "-restrict" command line option) that fixes the dependence problems, but it complains about vectorization may not be efficient. Adding "#pragma vector always" fixes that and it converts the adds to a single addps:

$ cat test2.cpp

#include "test.hpp"
Vector3D& Vector3D::operator += (Vector3D const& restrict ano)
{
#pragma vector always
x += ano.x;
y += ano.y;
z += ano.z;
t += ano.t;
return *this;
}

$ icc -S -restrict test2.cpp -vec-report3
test2.cpp(6): (col. 3) remark: BLOCK WAS VECTORIZED.
$ fgrep add test2.s
addps %xmm0, %xmm1 #6.3
$


Dale

hpsmouse
Beginner
155 Views
Well... the problem seems to become complicated...
I tried to change "main()" into "fun()", and the compiler no longer does the vectorization.
To force it to do the computation, I moved v1 and v2 to global variable area, and although the code in the operator += is vectorized, the inlined one in the main() function is still not...
[cpp]#include "test.hpp"
Vector3D& Vector3D::operator += (Vector3D const& ano)
{
  x += ano.x;
  y += ano.y;
  z += ano.z;
  t += ano.t;
  return *this;
}

Vector3D v1 = {1.0, 1.0, 1.0, -1.0};
Vector3D v2 = {2.0, 2.0, 2.0, -2.0};

int main(int argc, char** argv) {
  v2 += v1;
  return 0;
}[/cpp]
[plain]Disassembly of section .text:

0000000000000000 
: 0: 55 push rbp 1: 48 89 e5 mov rbp,rsp 4: 48 83 e4 80 and rsp,0xffffffffffffff80 8: 48 81 ec 80 00 00 00 sub rsp,0x80 f: bf 03 00 00 00 mov edi,0x3 14: e8 00 00 00 00 call 19
15: R_X86_64_PC32 __intel_new_proc_init-0x4 19: f3 0f 10 05 00 00 00 movss xmm0,DWORD PTR [rip+0x0] # 21
20: 00 1d: R_X86_64_PC32 v2-0x4 21: f3 0f 58 05 00 00 00 addss xmm0,DWORD PTR [rip+0x0] # 29
28: 00 25: R_X86_64_PC32 v1-0x4 29: f3 0f 10 0d 00 00 00 movss xmm1,DWORD PTR [rip+0x0] # 31
30: 00 2d: R_X86_64_PC32 v2 31: f3 0f 58 0d 00 00 00 addss xmm1,DWORD PTR [rip+0x0] # 39
38: 00 35: R_X86_64_PC32 v1 39: f3 0f 10 15 00 00 00 movss xmm2,DWORD PTR [rip+0x0] # 41
40: 00 3d: R_X86_64_PC32 v2+0x4 41: f3 0f 58 15 00 00 00 addss xmm2,DWORD PTR [rip+0x0] # 49
48: 00 45: R_X86_64_PC32 v1+0x4 49: f3 0f 10 1d 00 00 00 movss xmm3,DWORD PTR [rip+0x0] # 51
50: 00 4d: R_X86_64_PC32 v2+0x8 51: f3 0f 58 1d 00 00 00 addss xmm3,DWORD PTR [rip+0x0] # 59
58: 00 55: R_X86_64_PC32 v1+0x8 59: f3 0f 11 05 00 00 00 movss DWORD PTR [rip+0x0],xmm0 # 61
60: 00 5d: R_X86_64_PC32 v2-0x4 61: f3 0f 11 0d 00 00 00 movss DWORD PTR [rip+0x0],xmm1 # 69
68: 00 65: R_X86_64_PC32 v2 69: f3 0f 11 15 00 00 00 movss DWORD PTR [rip+0x0],xmm2 # 71
70: 00 6d: R_X86_64_PC32 v2+0x4 71: f3 0f 11 1d 00 00 00 movss DWORD PTR [rip+0x0],xmm3 # 79
78: 00 75: R_X86_64_PC32 v2+0x8 79: 33 c0 xor eax,eax 7b: 0f ae 1c 24 stmxcsr DWORD PTR [rsp] 7f: 81 0c 24 40 80 00 00 or DWORD PTR [rsp],0x8040 86: 0f ae 14 24 ldmxcsr DWORD PTR [rsp] 8a: 48 89 ec mov rsp,rbp 8d: 5d pop rbp 8e: c3 ret 8f: 90 nop 0000000000000090 <_ZN8Vector3DpLERKS_>: 90: 0f 28 07 movaps xmm0,XMMWORD PTR [rdi] 93: 0f 58 06 addps xmm0,XMMWORD PTR [rsi] 96: 0f 29 07 movaps XMMWORD PTR [rdi],xmm0 99: 48 89 f8 mov rax,rdi 9c: c3 ret 9d: 48 89 f6 mov rsi,rsi [/plain]
It seems ICC dare not vectorize the code unless I force it using "#pragma vector always"
Michael_K_Intel2
Employee
156 Views
Quoting - hpsmouse
Well... the problem seems to become complicated...

It seems ICC dare not vectorize the code unless I force it using "#pragma vector always"

Hi,

Personal note: Compilers are amongst the most complicated parts of software out there. Engineering a good compiler needs a lot of black magic, pardon, experience :-).

But back to your problem: Dale seems to be right. If you mind adding the pragma to your code, you can still use SSE intrinsic and write C code that directly maps to assembly code in the way you want it:

[cpp]Vector3D& Vector3D::operator += (Vector3D const& ano)
{
    __m128 *this128 = (__m128 *) this;
    __m128 *ano128 = (__m128 *) &ano;
    *this128 = _mm_add_ps(*this128, *ano128);
    return *this;
}
[/cpp]
This is compiled by the compiler to the following assembly code fragment:

[cpp]0000000000400b60 <_ZN8Vector3DpLERKS_>:
  400b60:       0f 28 07                movaps (%rdi),%xmm0
  400b63:       0f 58 06                addps  (%rsi),%xmm0
  400b66:       0f 29 07                movaps %xmm0,(%rdi)
  400b69:       48 89 f8                mov    %rdi,%rax
  400b6c:       c3                      retq
  400b6d:       48 89 f6                mov    %rsi,%rsi
[/cpp]

Cheers,
-michael




View solution in original post

hpsmouse
Beginner
155 Views

Hi,

Personal note: Compilers are amongst the most complicated parts of software out there. Engineering a good compiler needs a lot of black magic, pardon, experience :-).

But back to your problem: Dale seems to be right. If you mind adding the pragma to your code, you can still use SSE intrinsic and write C code that directly maps to assembly code in the way you want it:

Cheers,
-michael


Yes, so this is likely to be the final solution, and I still have to write specialized code for different platforms.
That's OK. It can't be a big problem.
hpsmouse
Beginner
155 Views
And thanks for your help again~
jeff_keasler
Beginner
155 Views
Would it be possible for the compiler team to add a directive to work around this kind of problem when the dependencies can't be resolved explicitly? The example presented here was small, but in practice, this could have been hundreds of lines of code that would have been hard to convert to SSE intrinsics directly.
Dale_S_Intel
Employee
155 Views
Quoting - jeff_keasler
Would it be possible for the compiler team to add a directive to work around this kind of problem when the dependencies can't be resolved explicitly? The example presented here was small, but in practice, this could have been hundreds of lines of code that would have been hard to convert to SSE intrinsics directly.

Well, there may be one already. Can you be more specific about exactly what you want the pragma to do?

Dale

jeff_keasler
Beginner
155 Views

Well, there may be one already. Can you be more specific about exactly what you want the pragma to do?

Dale


Well, what I'd really like is to understand is why this doesn't work (sorry for the overkill declaration of Vector3D_aligned):

struct Vector3D
{
float x, y, z, t;
Vector3D& operator += (Vector3D const& ano);
} __attribute__((aligned (16)));

typedef Vector3D __attribute__ ((aligned (16))) Vector3D_aligned ;

Vector3D& Vector3D::operator += (Vector3D const& ano)
{
Vector3D_aligned * restrict surrogateThis = this ;
const Vector3D_aligned * restrict surrogateAno = &ano ;
__assume_aligned(surrogateThis, 16) ;
__assume_aligned(surrogateAno, 16) ;
surrogateThis->x += surrogateAno->x;
surrogateThis->y += surrogateAno->y;
surrogateThis->z += surrogateAno->z;
surrogateThis->t += surrogateAno->t;
return *this;
}

Vector3D v1 = {1.0, 1.0, 1.0, -1.0};
Vector3D v2 = {2.0, 2.0, 2.0, -2.0};

int main(int argc, char** argv) {
v2 += v1;
return 0;
}

When I compile the above -S -O3 -restrict, I get a subroutine generated that never gets used:

# -- Begin _ZN8Vector3DpLERKS_
# mark_begin;
.align 16,0x90
.globl _ZN8Vector3DpLERKS_
_ZN8Vector3DpLERKS_:
# parameter 1: %rdi
# parameter 2: %rsi
..B2.1: # Preds ..B2.0
..___tag_value__ZN8Vector3DpLERKS_.10: #11.1
movaps (%rdi), %xmm0 #16.3
addps (%rsi), %xmm0 #16.3
movaps %xmm0, (%rdi) #16.3
movq %rdi, %rax #20.10
ret #20.10
.align 16,0x90
..___tag_value__ZN8Vector3DpLERKS_.11: #
# LOE
# mark_end;
.type _ZN8Vector3DpLERKS_,@function
.size _ZN8Vector3DpLERKS_,.-_ZN8Vector3DpLERKS_
.data
# -- End _ZN8Vector3DpLERKS_

Thanks,
-Jeff
Dale_S_Intel
Employee
155 Views
Quoting - jeff_keasler

Well, what I'd really like is to understand is why this doesn't work (sorry for the overkill declaration of Vector3D_aligned):



Well, as near as I can tell it does work:

$ icpc -c -restrict -vec-report3 test3.cpp
test3.cpp(14): (col. 5) remark: BLOCK WAS VECTORIZED.
$

Where test3.cpp is the code above and in my case line 14 is the beginning of the adds in operator '+='. The reason for the unused function is that it gets inlined into main, but because it's visible outside of this object it needs to exist, in case you linked in another module that calls it.

Does that answer your question?

Dale

jeff_keasler
Beginner
155 Views

Well, as near as I can tell it does work:

$ icpc -c -restrict -vec-report3 test3.cpp
test3.cpp(14): (col. 5) remark: BLOCK WAS VECTORIZED.
$

Where test3.cpp is the code above and in my case line 14 is the beginning of the adds in operator '+='. The reason for the unused function is that it gets inlined into main, but because it's visible outside of this object it needs to exist, in case you linked in another module that calls it.

Does that answer your question?

Dale


Dale,

I'm using icpc version 11.1.064

Change the -c to -S in your compile line to see the assembly language output. In the main routine, it is vectorizing to produce the code in response #2 (inefficient), yet creating the assembly subroutine as described in response #12 that never gets called. In other words, it generates efficient assembly for the C code in response #12, but never calls it or inlines it into the main routine.

I think this is a bug in the compiler. If it were fixed, it would address the original problem hpsmouse was seeing.

-Jeff
Dale_S_Intel
Employee
155 Views
Quoting - jeff_keasler

Dale,

I'm using icpc version 11.1.064

Change the -c to -S in your compile line to see the assembly language output. In the main routine, it is vectorizing to produce the code in response #2 (inefficient), yet creating the assembly subroutine as described in response #12 that never gets called. In other words, it generates efficient assembly for the C code in response #12, but never calls it or inlines it into the main routine.

I think this is a bug in the compiler. If it were fixed, it would address the original problem hpsmouse was seeing.

-Jeff

Ahh, I see. I had glanced at the asm, but not examined it closely enough. It looks like I'm seeing the same thing. Let me get back to you on that.

Dale
Dale_S_Intel
Employee
155 Views

BTW, I did end up filing an issue on this, you can refer to as cq149191 for future reference. I'll try to post back here when I get any information on it.

Thanks!

Dale

hpsmouse
Beginner
155 Views

Really thanks for your caring aboutthis topic for such a long time!

Now I believe this may be more a bug now, but there is something strange. As I said in #6, if I change the function "main" into function "fun", or some other names else, even the vectorization in the operator += is not made...

BTW, what does that "cq149191" mean? I'm new here.

jeff_keasler
Beginner
155 Views

BTW, I did end up filing an issue on this, you can refer to as cq149191 for future reference. I'll try to post back here when I get any information on it.

Thanks!

Dale

Thank you, and thanks to whoever works on this. This will be good stuff, especially if the simpler code in #6 can be recognized.
TimP
Black Belt
155 Views
Quoting hpsmouse

BTW, what does that "cq149191" mean?

That's the tracker number in the ClearQuest tracker application at Intel.
Reply