Solved: Re: Is it so hard to make ICC generate SSE packed arithmetic in

hpsmouse · ‎12-27-2009

I wrote a class to perform basic arithmetic on 3-D vectors, and the SSE instructions seem to improve performance.
The class is simple(some part of the code ignored):

[cpp]struct Vector3D
{
  float x, y, z, t;
} __attribute__((aligned (16)));[/cpp]

I added a extra member "t", so that I can treat a structure of Vector3D as a separate XMMWORD.
One of its member function is simple, too:

[cpp]  Vector3D& Vector3D:: operator += (Vector3D const & ano)
  {
    x += ano.x;
    y += ano.y;
    z += ano.z;
    t += ano.t;
    return *this;
  }[/cpp]

To make it bit-copyable, No copy-constructor nor assignment operator is declared.
When I try it with ICC, I disappointedly found a huge of block of "movss" and "addss" instructions, not the expecting "addps".
So what's the problem?

Michael_K_Intel2 · ‎12-29-2009

Quoting - hpsmouse

Well... the problem seems to become complicated...

It seems ICC dare not vectorize the code unless I force it using "#pragma vector always"

Hi,

Personal note: Compilers are amongst the most complicated parts of software out there. Engineering a good compiler needs a lot of black magic, pardon, experience :-).

But back to your problem: Dale seems to be right. If you mind adding the pragma to your code, you can still use SSE intrinsic and write C code that directly maps to assembly code in the way you want it:

[cpp]Vector3D& Vector3D::operator += (Vector3D const& ano)
{
    __m128 *this128 = (__m128 *) this;
    __m128 *ano128 = (__m128 *) &ano;
    *this128 = _mm_add_ps(*this128, *ano128);
    return *this;
}
[/cpp]

This is compiled by the compiler to the following assembly code fragment:

[cpp]0000000000400b60 <_ZN8Vector3DpLERKS_>:
  400b60:       0f 28 07                movaps (%rdi),%xmm0
  400b63:       0f 58 06                addps  (%rsi),%xmm0
  400b66:       0f 29 07                movaps %xmm0,(%rdi)
  400b69:       48 89 f8                mov    %rdi,%rax
  400b6c:       c3                      retq
  400b6d:       48 89 f6                mov    %rsi,%rsi
[/cpp]

Cheers,
-michael

View solution in original post

Michael_K_Intel2 · ‎12-28-2009

Quoting - hpsmouse

When I try it with ICC, I disappointedly found a huge of block of "movss" and "addss" instructions, not the expecting "addps".
So what's the problem?

Hi!

I had a look at the code and it looks like ICC should generate SSE-enabled code for the snippets you have posted. At least my ICC on my box does it:

00000000004009f0 <_ZN8Vector3DpLERKS_>:
movaps (%rdi),%xmm0
addps (%rsi),%xmm0
movaps %xmm0,(%rdi)
mov %rdi,%rax
retq
mov %rsi,%rsi

As you only provided the definition of "operator+=", I suspect that you've declared the function as a virtual function. In this case, the floats in your struct are no longer aligned (there's a set of pointers at the beginning of the struct needed for C++ features). Hence, the compiler cannot generate SSE code for your code. If there's a virtual at the declaration of "operator+=", please remove it and the compiler will do want you expect.

Hope that helps.

Cheers,
-michael

hpsmouse · ‎12-28-2009

Quoting - Michael Klemm, Intel

As you only provided the definition of "operator+=", I suspect that you've declared the function as a virtual function. In this case, the floats in your struct are no longer aligned (there's a set of pointers at the beginning of the struct needed for C++ features). Hence, the compiler cannot generate SSE code for your code. If there's a virtual at the declaration of "operator+=", please remove it and the compiler will do want you expect.

Hope that helps.

Cheers,
-michael

Hello,
Thanks for your help.
But the fact is I designed it as a POD type, without any virtual functions, and is bit-copyable.
Just now I did a test, which cannot be more simple:

[shell]~/projects/atest$ cat test.hpp
struct Vector3D
{
  float x, y, z, t;
  Vector3D& operator += (Vector3D const& ano);
} __attribute__((aligned(16)));

~/projects/atest$ cat test.cpp
#include "test.hpp"
Vector3D& Vector3D::operator += (Vector3D const& ano)
{
  x += ano.x;
  y += ano.y;
  z += ano.z;
  t += ano.t;
  return *this;
}

~/projects/atest$ icpc -Wall -c test.cpp[/shell]

And this is the result:

[plain]Disassembly of section .text:

0000000000000000 <_ZN8Vector3DpLERKS_>:
   0:	f3 0f 10 07          	movss  xmm0,DWORD PTR [rdi]
   4:	f3 0f 58 06          	addss  xmm0,DWORD PTR [rsi]
   8:	f3 0f 10 4f 04       	movss  xmm1,DWORD PTR [rdi+0x4]
   d:	f3 0f 10 57 08       	movss  xmm2,DWORD PTR [rdi+0x8]
  12:	f3 0f 10 5f 0c       	movss  xmm3,DWORD PTR [rdi+0xc]
  17:	f3 0f 11 07          	movss  DWORD PTR [rdi],xmm0
  1b:	f3 0f 58 4e 04       	addss  xmm1,DWORD PTR [rsi+0x4]
  20:	f3 0f 11 4f 04       	movss  DWORD PTR [rdi+0x4],xmm1
  25:	f3 0f 58 56 08       	addss  xmm2,DWORD PTR [rsi+0x8]
  2a:	48 89 f8             	mov    rax,rdi
  2d:	f3 0f 11 57 08       	movss  DWORD PTR [rdi+0x8],xmm2
  32:	f3 0f 58 5e 0c       	addss  xmm3,DWORD PTR [rsi+0xc]
  37:	f3 0f 11 5f 0c       	movss  DWORD PTR [rdi+0xc],xmm3
  3c:	c3                   	ret    
  3d:	48 89 f6             	mov    rsi,rsi[/plain]

By the way, my system is ubuntu 9.04 64-bit version on a Core2 P8400, and ICC version is 11.1.059. I installed both the IA-32 and the Intel 64 version, and used the latter one for the test.

hpsmouse · ‎12-28-2009

Sorry, the bad connection made me double-clicked on the post button, resulting in two replies...

Michael_K_Intel2 · ‎12-28-2009

Hi!

OK, here we are... If I use the code as you have supplied it in your last post, my version of ICC also generates the assembly code you've seen. If I, however, add a simple main method

[cpp]int main(int argc, char** argv) {
    Vector3D v1 = {1.0, 1.0, 1.0, -1.0};
    Vector3D v2 = {2.0, 2.0, 2.0, -2.0};
    v2 += v1;
}
[/cpp]

I get the vectorized code. Looks like the compiler only optimizes the code if the operator is actually called somewhere in the code. I guess that you'll have two options to work around this: (a) place a static dummy function that adds to vectors, (b) use an instrinsic function to explicitly enable an SSE addps in the code.

Cheers,
-michael

Dale_S_Intel · ‎12-28-2009

Quoting - Michael Klemm, Intel

Hi!

OK, here we are... If I use the code as you have supplied it in your last post, my version of ICC also generates the assembly code you've seen. If I, however, add a simple main method

[cpp]int main(int argc, char** argv) {
    Vector3D v1 = {1.0, 1.0, 1.0, -1.0};
    Vector3D v2 = {2.0, 2.0, 2.0, -2.0};
    v2 += v1;
}
[/cpp]

I get the vectorized code. Looks like the compiler only optimizes the code if the operator is actually called somewhere in the code. I guess that you'll have two options to work around this: (a) place a static dummy function that adds to vectors, (b) use an instrinsic function to explicitly enable an SSE addps in the code.

Cheers,
-michael

Actually I think what's going on in this case is that the compiler is inlining the "+" in this case, and can therefore see that v1 and v2 do not overlap. Using the "-vec-report3" option in the original case, it complains about numerous assumed dependences. If you add the "restrict" keyword (and the requisite "-restrict" command line option) that fixes the dependence problems, but it complains about vectorization may not be efficient. Adding "#pragma vector always" fixes that and it converts the adds to a single addps:

$ cat test2.cpp

#include "test.hpp"
Vector3D& Vector3D::operator += (Vector3D const& restrict ano)
{
#pragma vector always
x += ano.x;
y += ano.y;
z += ano.z;
t += ano.t;
return *this;
}

$ icc -S -restrict test2.cpp -vec-report3
test2.cpp(6): (col. 3) remark: BLOCK WAS VECTORIZED.
$ fgrep add test2.s
addps %xmm0, %xmm1 #6.3
$

Dale

hpsmouse · ‎12-28-2009

Well... the problem seems to become complicated...
I tried to change "main()" into "fun()", and the compiler no longer does the vectorization.
To force it to do the computation, I moved v1 and v2 to global variable area, and although the code in the operator += is vectorized, the inlined one in the main() function is still not...

[cpp]#include "test.hpp"
Vector3D& Vector3D::operator += (Vector3D const& ano)
{
  x += ano.x;
  y += ano.y;
  z += ano.z;
  t += ano.t;
  return *this;
}

Vector3D v1 = {1.0, 1.0, 1.0, -1.0};
Vector3D v2 = {2.0, 2.0, 2.0, -2.0};

int main(int argc, char** argv) {
  v2 += v1;
  return 0;
}[/cpp]

[plain]Disassembly of section .text:

0000000000000000 :
   0:	55                   	push   rbp
   1:	48 89 e5             	mov    rbp,rsp
   4:	48 83 e4 80          	and    rsp,0xffffffffffffff80
   8:	48 81 ec 80 00 00 00 	sub    rsp,0x80
   f:	bf 03 00 00 00       	mov    edi,0x3
  14:	e8 00 00 00 00       	call   19 
			15: R_X86_64_PC32	__intel_new_proc_init-0x4
  19:	f3 0f 10 05 00 00 00 	movss  xmm0,DWORD PTR [rip+0x0]        # 21 
  20:	00 
			1d: R_X86_64_PC32	v2-0x4
  21:	f3 0f 58 05 00 00 00 	addss  xmm0,DWORD PTR [rip+0x0]        # 29 
  28:	00 
			25: R_X86_64_PC32	v1-0x4
  29:	f3 0f 10 0d 00 00 00 	movss  xmm1,DWORD PTR [rip+0x0]        # 31 
  30:	00 
			2d: R_X86_64_PC32	v2
  31:	f3 0f 58 0d 00 00 00 	addss  xmm1,DWORD PTR [rip+0x0]        # 39 
  38:	00 
			35: R_X86_64_PC32	v1
  39:	f3 0f 10 15 00 00 00 	movss  xmm2,DWORD PTR [rip+0x0]        # 41 
  40:	00 
			3d: R_X86_64_PC32	v2+0x4
  41:	f3 0f 58 15 00 00 00 	addss  xmm2,DWORD PTR [rip+0x0]        # 49 
  48:	00 
			45: R_X86_64_PC32	v1+0x4
  49:	f3 0f 10 1d 00 00 00 	movss  xmm3,DWORD PTR [rip+0x0]        # 51 
  50:	00 
			4d: R_X86_64_PC32	v2+0x8
  51:	f3 0f 58 1d 00 00 00 	addss  xmm3,DWORD PTR [rip+0x0]        # 59 
  58:	00 
			55: R_X86_64_PC32	v1+0x8
  59:	f3 0f 11 05 00 00 00 	movss  DWORD PTR [rip+0x0],xmm0        # 61 
  60:	00 
			5d: R_X86_64_PC32	v2-0x4
  61:	f3 0f 11 0d 00 00 00 	movss  DWORD PTR [rip+0x0],xmm1        # 69 
  68:	00 
			65: R_X86_64_PC32	v2
  69:	f3 0f 11 15 00 00 00 	movss  DWORD PTR [rip+0x0],xmm2        # 71 
  70:	00 
			6d: R_X86_64_PC32	v2+0x4
  71:	f3 0f 11 1d 00 00 00 	movss  DWORD PTR [rip+0x0],xmm3        # 79 
  78:	00 
			75: R_X86_64_PC32	v2+0x8
  79:	33 c0                	xor    eax,eax
  7b:	0f ae 1c 24          	stmxcsr DWORD PTR [rsp]
  7f:	81 0c 24 40 80 00 00 	or     DWORD PTR [rsp],0x8040
  86:	0f ae 14 24          	ldmxcsr DWORD PTR [rsp]
  8a:	48 89 ec             	mov    rsp,rbp
  8d:	5d                   	pop    rbp
  8e:	c3                   	ret    
  8f:	90                   	nop    

0000000000000090 <_ZN8Vector3DpLERKS_>:
  90:	0f 28 07             	movaps xmm0,XMMWORD PTR [rdi]
  93:	0f 58 06             	addps  xmm0,XMMWORD PTR [rsi]
  96:	0f 29 07             	movaps XMMWORD PTR [rdi],xmm0
  99:	48 89 f8             	mov    rax,rdi
  9c:	c3                   	ret    
  9d:	48 89 f6             	mov    rsi,rsi
[/plain]

It seems ICC dare not vectorize the code unless I force it using "#pragma vector always"

Michael_K_Intel2 · ‎12-29-2009

Quoting - hpsmouse

Well... the problem seems to become complicated...

It seems ICC dare not vectorize the code unless I force it using "#pragma vector always"

Hi,

Personal note: Compilers are amongst the most complicated parts of software out there. Engineering a good compiler needs a lot of black magic, pardon, experience :-).

But back to your problem: Dale seems to be right. If you mind adding the pragma to your code, you can still use SSE intrinsic and write C code that directly maps to assembly code in the way you want it:

[cpp]Vector3D& Vector3D::operator += (Vector3D const& ano)
{
    __m128 *this128 = (__m128 *) this;
    __m128 *ano128 = (__m128 *) &ano;
    *this128 = _mm_add_ps(*this128, *ano128);
    return *this;
}
[/cpp]

This is compiled by the compiler to the following assembly code fragment:

[cpp]0000000000400b60 <_ZN8Vector3DpLERKS_>:
  400b60:       0f 28 07                movaps (%rdi),%xmm0
  400b63:       0f 58 06                addps  (%rsi),%xmm0
  400b66:       0f 29 07                movaps %xmm0,(%rdi)
  400b69:       48 89 f8                mov    %rdi,%rax
  400b6c:       c3                      retq
  400b6d:       48 89 f6                mov    %rsi,%rsi
[/cpp]

Cheers,
-michael

hpsmouse · ‎12-29-2009

Quoting - Michael Klemm (Intel)

Hi,

Personal note: Compilers are amongst the most complicated parts of software out there. Engineering a good compiler needs a lot of black magic, pardon, experience :-).

But back to your problem: Dale seems to be right. If you mind adding the pragma to your code, you can still use SSE intrinsic and write C code that directly maps to assembly code in the way you want it:

Cheers,
-michael

Yes, so this is likely to be the final solution, and I still have to write specialized code for different platforms.
That's OK. It can't be a big problem.

hpsmouse · ‎12-29-2009

And thanks for your help again~

jeff_keasler · ‎01-01-2010

Would it be possible for the compiler team to add a directive to work around this kind of problem when the dependencies can't be resolved explicitly? The example presented here was small, but in practice, this could have been hundreds of lines of code that would have been hard to convert to SSE intrinsics directly.

Dale_S_Intel · ‎01-04-2010

Quoting - jeff_keasler

Would it be possible for the compiler team to add a directive to work around this kind of problem when the dependencies can't be resolved explicitly? The example presented here was small, but in practice, this could have been hundreds of lines of code that would have been hard to convert to SSE intrinsics directly.

Well, there may be one already. Can you be more specific about exactly what you want the pragma to do?

Dale

jeff_keasler · ‎01-07-2010

Quoting - Dale Schouten (Intel)

Well, there may be one already. Can you be more specific about exactly what you want the pragma to do?

Dale

Well, what I'd really like is to understand is why this doesn't work (sorry for the overkill declaration of Vector3D_aligned):

struct Vector3D
{
float x, y, z, t;
Vector3D& operator += (Vector3D const& ano);
} __attribute__((aligned (16)));

typedef Vector3D __attribute__ ((aligned (16))) Vector3D_aligned ;

Vector3D& Vector3D::operator += (Vector3D const& ano)
{
Vector3D_aligned * restrict surrogateThis = this ;
const Vector3D_aligned * restrict surrogateAno = &ano ;
__assume_aligned(surrogateThis, 16) ;
__assume_aligned(surrogateAno, 16) ;
surrogateThis->x += surrogateAno->x;
surrogateThis->y += surrogateAno->y;
surrogateThis->z += surrogateAno->z;
surrogateThis->t += surrogateAno->t;
return *this;
}

Vector3D v1 = {1.0, 1.0, 1.0, -1.0};
Vector3D v2 = {2.0, 2.0, 2.0, -2.0};

int main(int argc, char** argv) {
v2 += v1;
return 0;
}

When I compile the above -S -O3 -restrict, I get a subroutine generated that never gets used:

# -- Begin _ZN8Vector3DpLERKS_
# mark_begin;
.align 16,0x90
.globl _ZN8Vector3DpLERKS_
_ZN8Vector3DpLERKS_:
# parameter 1: %rdi
# parameter 2: %rsi
..B2.1: # Preds ..B2.0
..___tag_value__ZN8Vector3DpLERKS_.10: #11.1
movaps (%rdi), %xmm0 #16.3
addps (%rsi), %xmm0 #16.3
movaps %xmm0, (%rdi) #16.3
movq %rdi, %rax #20.10
ret #20.10
.align 16,0x90
..___tag_value__ZN8Vector3DpLERKS_.11: #
# LOE
# mark_end;
.type _ZN8Vector3DpLERKS_,@function
.size _ZN8Vector3DpLERKS_,.-_ZN8Vector3DpLERKS_
.data
# -- End _ZN8Vector3DpLERKS_

Thanks,
-Jeff

Dale_S_Intel · ‎01-07-2010

Quoting - jeff_keasler

Well, what I'd really like is to understand is why this doesn't work (sorry for the overkill declaration of Vector3D_aligned):

Well, as near as I can tell it does work:

$ icpc -c -restrict -vec-report3 test3.cpp
test3.cpp(14): (col. 5) remark: BLOCK WAS VECTORIZED.
$

Where test3.cpp is the code above and in my case line 14 is the beginning of the adds in operator '+='. The reason for the unused function is that it gets inlined into main, but because it's visible outside of this object it needs to exist, in case you linked in another module that calls it.

Does that answer your question?

Dale

jeff_keasler · ‎01-07-2010

Quoting - Dale Schouten (Intel)

Well, as near as I can tell it does work:

$ icpc -c -restrict -vec-report3 test3.cpp
test3.cpp(14): (col. 5) remark: BLOCK WAS VECTORIZED.
$

Where test3.cpp is the code above and in my case line 14 is the beginning of the adds in operator '+='. The reason for the unused function is that it gets inlined into main, but because it's visible outside of this object it needs to exist, in case you linked in another module that calls it.

Does that answer your question?

Dale

Dale,

I'm using icpc version 11.1.064

Change the -c to -S in your compile line to see the assembly language output. In the main routine, it is vectorizing to produce the code in response #2 (inefficient), yet creating the assembly subroutine as described in response #12 that never gets called. In other words, it generates efficient assembly for the C code in response #12, but never calls it or inlines it into the main routine.

I think this is a bug in the compiler. If it were fixed, it would address the original problem hpsmouse was seeing.

-Jeff

Dale_S_Intel · ‎01-08-2010

Quoting - jeff_keasler

Dale,

I'm using icpc version 11.1.064

Change the -c to -S in your compile line to see the assembly language output. In the main routine, it is vectorizing to produce the code in response #2 (inefficient), yet creating the assembly subroutine as described in response #12 that never gets called. In other words, it generates efficient assembly for the C code in response #12, but never calls it or inlines it into the main routine.

I think this is a bug in the compiler. If it were fixed, it would address the original problem hpsmouse was seeing.

-Jeff

Ahh, I see. I had glanced at the asm, but not examined it closely enough. It looks like I'm seeing the same thing. Let me get back to you on that.

Dale

Dale_S_Intel · ‎02-05-2010

BTW, I did end up filing an issue on this, you can refer to as cq149191 for future reference. I'll try to post back here when I get any information on it.

Thanks!

Dale

hpsmouse · ‎02-07-2010

Really thanks for your caring aboutthis topic for such a long time!

Now I believe this may be more a bug now, but there is something strange. As I said in #6, if I change the function "main" into function "fun", or some other names else, even the vectorization in the operator += is not made...

BTW, what does that "cq149191" mean? I'm new here.

jeff_keasler · ‎02-07-2010

Quoting Dale Schouten (Intel)

BTW, I did end up filing an issue on this, you can refer to as cq149191 for future reference. I'll try to post back here when I get any information on it.

Thanks!

Dale

Thank you, and thanks to whoever works on this. This will be good stuff, especially if the simpler code in #6 can be recognized.

TimP · ‎02-08-2010

Quoting hpsmouse

BTW, what does that "cq149191" mean?

That's the tracker number in the ClearQuest tracker application at Intel.

Is it so hard to make ICC generate SSE packed arithmetic instructions?