Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

OpenMP and Vectorization Problem

Ashkan_T_
Beginner
1,495 Views
 

Hi,

I am using a simple ikj triple loop to compute a matrix multiplication. The intel compiler icpc (ICC) 14.0.2 20140120 is used.

Suppose that in the 2 following cases the number of threads is 1 (one) (No parallel for is used yet!)

1- If I use a #pragma omp parallel, the compiled code is seemed to be vectorized. That is what -vec-report6 tells me. But the running time is equal to the non-vectorized case: 

MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has aligned access
MATMUL.cc(71): (col. 4) remark: vectorization support: unroll factor set to 4
MATMUL.cc(71): (col. 4) remark: LOOP WAS VECTORIZED
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has unaligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: unaligned access used inside loop body
MATMUL.cc(71): (col. 4) remark: REMAINDER LOOP WAS VECTORIZED

 

2- On the other hand, if I simply remove the #pragma omp parallel, This message is printed out by the -vec-report6:  

MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has aligned access
MATMUL.cc(71): (col. 4) remark: vectorization support: unroll factor set to 4
MATMUL.cc(71): (col. 4) remark: LOOP WAS VECTORIZED
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has unaligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: unaligned access used inside loop body
MATMUL.cc(71): (col. 4) remark: REMAINDER LOOP WAS VECTORIZED
MATMUL.cc(71): (col. 4) remark: loop skipped: multiversioned

Although it says "loop skipped: multiversioned", which I am not sure what it exactly means, the running time is roughly 6X better, which implies the proper vectorization. Using the #pragma omp simd does not change the results.

void MatMul_Par(float* A, float* B, float* C) {
 //#pragma omp parallel shared(A,B,C)
   {
     for (int i=0;i<N;i++) {
           for(int k=0;k<N;k++) { 
                float temp = A[i*N+k];
                //#pragma omp simd
                for(int j=0;j<N;j++) {
                    C[i*N+j] += temp * B[k*N+j];
                 } 
           }
    }
   } //parallel
}

PS: The problem does not exist when using Intel Cilk Plus, etc. It seems to be related to the parallel pragma in OpenMP.

0 Kudos
1 Solution
TimP
Honored Contributor III
1,495 Views

icpc 14.0 still depends on -ansi-alias -O3 for full optimization; ansi-alias will be a default for linux in 15.0.

Even with an effort to avoid questions on whether a vectorized branch is taken, you would see only a fraction of MKL library performance, so it seems counter-productive to spend much time on this.  The beta version of MKL is supposed to improve performance on smaller cases where source code might have been considered in the past.

View solution in original post

0 Kudos
9 Replies
TimP
Honored Contributor III
1,495 Views

There's no point in the omp parallel unless you make the outer loop an omp for (parallel) loop.  

As long as you get vectorization, there's little need to designate an omp simd loop.  Versioning might be a response to your omission of __restrict qualifier or other means of asserting non-overlap of input and output data. Are you relying on inter-procedural optimizations for that purpose?

Normally, you want want an outer threaded loop and an inner vector (simd) loop.  Depending on what loop counts the compiler is optimizing for, it may choose some cache blocking.  However, you would use the MKL library provided with icc if your matrices are large enough to get full benefit from such optimizations.

As your vector report line numbers don't match your sample, and you haven't said much about your goals or compiler options,  I don't think there's much more to say.

0 Kudos
Ashkan_T_
Beginner
1,495 Views

Thanks! I have found the problem. Using dynamic arrays inside a function that uses vectorization causes the problem.

There are definitely some issues with vectorization when using OpenMP. I will try to post it  separately as a complete example.

0 Kudos
Shenghong_G_Intel
1,494 Views

Hi Ashkan,

Looks like your code have some issues, you should use "omp parallel for" instead of "omp parallel", they have different meanings.

It should have no issue to use parallelism with vectorization, the Cilk Plus version of the code will be:

void MatMul_Par(float* A, float* B, float* C) {
     cilk_for (int i=0;i<N;i++) {
           for(int k=0;k<N;k++) { 
                float temp = A[i*N+k];
                #pragma simd
                for(int j=0;j<N;j++) {
                    C[i*N+j] += temp * B[k*N+j];
                 } 
           }
    }
}

And the equivalent OpenMP version will be:

void MatMul_Par(float* A, float* B, float* C) {
         #pragma omp parallel for
    for (int i=0;i<N;i++) {
           for(int k=0;k<N;k++) { 
                float temp = A[i*N+k];
                #pragma simd
                for(int j=0;j<N;j++) {
                    C[i*N+j] += temp * B[k*N+j];
                 } 
           }
    }
}

You may provide the complete example if you still have issues, as you mentioned.

Thanks,

Shenghong

0 Kudos
Ashkan_T_
Beginner
1,495 Views

Hi Shenghong,

Thanks a lot for your answer. The problem is that I cannot edit the post. So, it seems to be confusing for people. That's why I wanted to put it as a separate post. but let's discuss it here!

I know the differences between parallel and parallel for, but my point is that the code above (as is) should be able to vectorize the innermost loop.

Now, let's forget about my code and explore your code. I want you to create the arrays dynamically, and pass them to the MatMul_Par function. I have done so. This is what happens:

#include <stdio.h>
#include <omp.h>
#define N 4096  

void MatMul_Par(double* A, double* B, double* C) {
 #pragma omp parallel for
 for (int i=0;i<N;i++) {
  for(int k=0;k<N;k++) {
    double temp = A[i*N+k];
    #pragma simd
    for(int j=0;j<N;j++) {
        C[i*N+j] += temp * B[k*N+j];
    }
  }
 }
return;
}

int main(int argc, char* argv[]){
 double *a, *b, *c;
 a = new double[N*N];
 b = new double[N*N];
 c = new double[N*N];

 for(int i=0;i<N;i++)
  for(int j=0;j<N;j++) {
      a[i*N+j]=double(i+j);
      b[i*N+j]=double(i-j);
      c[i*N+j]=0.00;
  }
 omp_set_num_threads(240);
 MatMul_Par(a,b,c);
 free(a);free(b);free(c);
return 0;
}

This is the result of compilation.

icpc -mmic -openmp -no-offload -vec-report2 -Wall -O2 -std=c++0x intel_q.cc -o intel_q
intel_q.cc(26): (col. 3) remark: LOOP WAS VECTORIZED
intel_q.cc(25): (col. 2) remark: loop was not vectorized: not inner loop
intel_q.cc(32): (col. 2) remark: SIMD LOOP WAS VECTORIZED
intel_q.cc(32): (col. 2) remark: loop was not vectorized: not inner loop
intel_q.cc(32): (col. 2) remark: loop was not vectorized: not inner loop
intel_q.cc(11): (col. 5) remark: SIMD LOOP WAS VECTORIZED
intel_q.cc(8): (col. 3) remark: loop was not vectorized: not inner loop
intel_q.cc(7): (col. 2) remark: loop was not vectorized: not inner loop

 

As you can see, it says loop at the line 32 is vectorized(or not vectorized!). The thing is that there is no loop at the line 32! It is the function call!

and as a result the code is not vectorized properly. How do I know? by measuring the runtime on the Xeon Phi. The problem can be resolved if I do not use the function call and put the inline code after initialization. You can try it!

What do you think?

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,495 Views

Line 32 has an inlined function call. The entire inlined function  is attributed with line number 32. This function is also compiled out-of-line.

In both cases, the #pragma simd loop (inner most) was vectorized. When in doubt, look at the disassembly window within the debugger or via VTune. Note, producing disassembly listing is not sufficient when full IPO is enabled.

Jim Dempsey

0 Kudos
Ashkan_T_
Beginner
1,495 Views

Hi Jim,

Thanks, but this is my point here: Why is the runtime equal to the non-vectorized code? While it is not the case when using a plain code (without function call)

When I use the Cilk Plus version, no information is reported about that line (the function call), and it gives the proper vectorized code.

You can just try the code I have posted with this function call and without it to see the difference.

It is a very basic OpenMP code and the fact that it is not functioning as expected means that there are some issues.

0 Kudos
TimP
Honored Contributor III
1,496 Views

icpc 14.0 still depends on -ansi-alias -O3 for full optimization; ansi-alias will be a default for linux in 15.0.

Even with an effort to avoid questions on whether a vectorized branch is taken, you would see only a fraction of MKL library performance, so it seems counter-productive to spend much time on this.  The beta version of MKL is supposed to improve performance on smaller cases where source code might have been considered in the past.

0 Kudos
Ashkan_T_
Beginner
1,495 Views

Thanks Tim!

The problem can be resolved by using -ansi-alias flag. 

0 Kudos
TimP
Honored Contributor III
1,495 Views

-opt-assume-safe-padding is more helpful than other typical options which you omitted.  Scraping some rust off my 2 year-old MIC, VTune rates your inner loop vectorization utilization as 6.6 (82% efficient) so it's clearly vectorized, although not at all efficient in terms of fraction of peak pure floating point performance.

0 Kudos
Reply