Community
cancel
Showing results for 
Search instead for 
Did you mean: 
zhangxiuxia
Beginner
248 Views

Does ICC support __attribute__ ((aligned(16))) in arguments declare ?

I write a funtion namely
int product(double *__attribute__ ((aligned (16)))A, double *__attribute__((aligned (16)))x, double *__attribute__((aligned(16))) y, int n)
2 {
3 int i,j;
4 for(j=0;j<100000;j++)
5 {
6 for(i=0;i<n;i++)
7 {
8 y=y+A*x;
9 }
10 }
11 return 0;
12 }

it cannot pass when use icc to compile , but can pass when use gcc to compile ?

When I delare a variable using __attribute ((aligned(16))) in side a function, can pass compile when use icc .





0 Kudos
8 Replies
Aubrey_W_
New Contributor I
248 Views

Hello,

I will move this to our Intel C++ Compiler forum where one of our engineers can assist you.

Best regards,

==
Aubrey W.
Intel Software Network Support
BradleyKuszmaul
Beginner
248 Views

Here are some ways I found to make icc understand that my arguments are aligned.
Here is some code:
[cpp]void slowproduct (double *A, double *B, double * C) { #pragma intel simd for (int i=0; i<4; i++) { C += A * B; } } [/cpp]
The slowproduct code vectorizes, but on my i7-2640M which has AVX it doesn't use the 256-bit registers probably because it doesn't know that A, B, and C are aligned. It produces
[plain] vmovupd (%rdi), %xmm0 #4.18 vmulpd (%rsi), %xmm0, %xmm1 #4.18 vaddpd (%rdx), %xmm1, %xmm2 #4.2 vmovupd %xmm2, (%rdx) #1.6 vmovupd 16(%rdi), %xmm3 #4.18 vmulpd 16(%rsi), %xmm3, %xmm4 #4.18 vaddpd 16(%rdx), %xmm4, %xmm5 #4.2 vmovupd %xmm5, 16(%rdx) #1.6 [/plain]
But this code produces really good code. [cpp]struct d4 { double d[4] __attribute__((aligned(16))); }; void product (struct d4 * A, struct d4 * B, struct d4 *__restrict__ C) { for (int i=0; i<4; i++) { C->d += A->d * B->d; } } [/cpp]
It produces this for the whole loop:
[plain] vmovupd (%rdi), %ymm0 #15.24 vmulpd (%rsi), %ymm0, %ymm1 #15.24 vaddpd (%rdx), %ymm1, %ymm2 #15.2 vmovupd %ymm2, (%rdx) #15.2 [/plain] Which is a vector load a vector multiply and a vector add. I used the following to compile it:
[bash]icc -O2 -std=c99 -xHost -S -o slowprod.S slowprod.c[/bash] There are several relevant issues:
1) By putting the array into a struct I was able to make the struct be properly aligned. If you want an array of 100000 you may want to declare A as "struct d4 A[25000]" and then write the doubly nested loop.
2) I declared C to be __restrict__ so that the compiler would understand that it can vectorize the code. You can also do
#pragma intel simd
which will tell the compiler to vectorize.
3) I used -xHost to make the compiler produce the fastest code it can for my particular machine. Sandy bridge has AVX with 256-bit vector registers (4 doubles). Furthermore, sandy bridge can issue 8 floating point operations per cycle, so my 2.8GHz laptop can peak at 44.8GFLOPS (using both cores, but with turboboost disabled).
Here is another way to get the compiler to generate good code using pragmas.
[cpp]void fastproduct (double *A, double *B, double * C) { #pragma vector aligned #pragma intel simd for (int i=0; i<4; i++) { C += A * B; } }[/cpp]
Putting it all together, if I write this code
[cpp]void bigproduct(double *A, double *x, double * y, int n) { #pragma vector aligned #pragma intel simd for(int i=0;i=y+A*x; } }[/cpp]
It produces really nice avx instructions for the inner loop, and it unrolls the loop 4 times, producing this inner loop:
[plain]..B4.12: # Preds ..B4.12 ..B4.11 vmovupd (%rdi,%rax,8), %ymm0 #32.17 vmulpd (%rsi,%rax,8), %ymm0, %ymm1 #32.17 vaddpd (%rdx,%rax,8), %ymm1, %ymm2 #32.17 vmovupd %ymm2, (%rdx,%rax,8) #27.6 vmovupd 32(%rdi,%rax,8), %ymm3 #32.17 vmulpd 32(%rsi,%rax,8), %ymm3, %ymm4 #32.17 vaddpd 32(%rdx,%rax,8), %ymm4, %ymm5 #32.17 vmovupd %ymm5, 32(%rdx,%rax,8) #27.6 vmovupd 64(%rdi,%rax,8), %ymm6 #32.17 vmulpd 64(%rsi,%rax,8), %ymm6, %ymm7 #32.17 vaddpd 64(%rdx,%rax,8), %ymm7, %ymm8 #32.17 vmovupd %ymm8, 64(%rdx,%rax,8) #27.6 vmovupd 96(%rdi,%rax,8), %ymm9 #32.17 vmulpd 96(%rsi,%rax,8), %ymm9, %ymm10 #32.17 vaddpd 96(%rdx,%rax,8), %ymm10, %ymm11 #32.17 vmovupd %ymm11, 96(%rdx,%rax,8) #27.6 addq $16, %rax #31.5 cmpq %rcx, %rax #31.5 jb ..B4.12 # Prob 82% #31.5 [/plain]
I hope these ideas help.
-Bradley
TimP
Black Belt
248 Views

The 64-bit ABIs provide for default 16-byte alignment of objects large enough to need it (in contexts where the compiler is free to choose alignment). When icc doesn't give as strict alignments as gcc does, it seems to be a compatibility bug. The compiler does give 32-byte alignments already for some situations
I've heard that a command line option will come with 13.0 compiler which will allow specification of default alignments, at least up to the alignment required by future architectures, including cache line alignment.
The AVX code where the compiler chooses AVX-128 on account of not knowing alignment is better than the corresponding SSE4 code, but you may have to specify aligned(32) to get the best AVX alignment. Sandy Bridge has some severe performance issues at cache line boundaries with unaligned AVX-256 data. Ivy Bridge is supposed to correct these, but I haven't seen any distinction between them in the compiler.
In my experience, -xhost doesn't produce the fastest code for architectures prior to Sandy Bridge when it translates to -xSSE4.2. That's one of the reasons why some of us don't like -xhost.
BradleyKuszmaul
Beginner
248 Views

  1. I agree that the original post exposes an icc compatability bug. Icc should should accept __align__ attributes on array arguments.
  2. Are you saying that I should align things on 32-byte boundaries to get the best avx-256 performance on sandy bridge?
  3. Can you recommend a better compiler flag than -xHost for Nehalem? Is -xHost good for Sandy Bridge, or is there a better choice there too?
-Bradley
TimP
Black Belt
248 Views

2. Yes, if the compiler can be assured of 32-byte alignment, it should give optimum AVX performance. Ivy Bridge should be less critical, but the current compiler tends not to differentiate between them.
3. Depending on the application, I've found either -xSSE4.1 or (less often) -xSSE2 (the default) giving best results on Nehalem and Westmere. -xSSE4.1 uses SSSE3 code in some places where it's beneficial (and SSE4.2 doesn't) on those platforms. According to the published hardware optimization guides, the compiler is doing the right thing with -xHost, but it doesn't always work out. So you may as well use an option which works better on a wider range of architectures.
On Sandy Bridge, -xHost has to be the same as -xAVX, which is the only option for generating AVX code. I haven't tracked all the situations where I've seen AVX code slower than SSE2 in the past; the 12.1 compilers made big improvements there. I still don't have direct access to any Sandy Bridge. Some people have noticed that Ivy Bridge options run OK on Sandy Bridge, but it's likely to be because the code happened to come out identical.
Judith_W_Intel
Employee
248 Views


I entered this in our bug tracking database as DPD200283679.

Here is the test case I used:

// currently gets a compilation error with icpc but passes with g++

extern "C" int printf(const char*,...);

void foo(double *__attribute__ ((aligned (16)))A)
{
if (__alignof(A) == 16)
printf("PASSED\n");
else
printf("FAILED\n");
}

int main() {
double* A = new double;
foo(A);
return 0;
}

thank you for reporting this.

Judy

jeff_keasler
Beginner
248 Views

Issue #672743 has been tested, and allows you to do this:

typedef double * __restrict__ __attribute__((align_value (32))) Real_ptr ;

Creating such a typedef in a header file allows vectorization without cluttering your core code with compiler directives and restrict keywords.


If there is another standardized way of declaring this alignment attribute (as the rest of this thread implies), then perhaps that would be better syntax for this functionality? At any rate, I'm just happy that it works now. :)

BTW, Fixing issue 682457 would extend the scope where optimizations for the new typedef introduced in issue #672743 would apply.
JenniferJ
Moderator
248 Views

Issue reported by zhangxiuxia (internal tracker DPD200283679) has been fixed in 13.0. It is available for download from Intel Registration Center. Jennifer
Reply