- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am using a simple ikj triple loop to compute a matrix multiplication. The intel compiler icpc (ICC) 14.0.2 20140120 is used.
Suppose that in the 2 following cases the number of threads is 1 (one) (No parallel for is used yet!)
1- If I use a #pragma omp parallel, the compiled code is seemed to be vectorized. That is what -vec-report6 tells me. But the running time is equal to the non-vectorized case:
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has aligned access
MATMUL.cc(71): (col. 4) remark: vectorization support: unroll factor set to 4
MATMUL.cc(71): (col. 4) remark: LOOP WAS VECTORIZED
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has unaligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: unaligned access used inside loop body
MATMUL.cc(71): (col. 4) remark: REMAINDER LOOP WAS VECTORIZED
2- On the other hand, if I simply remove the #pragma omp parallel, This message is printed out by the -vec-report6:
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has aligned access
MATMUL.cc(71): (col. 4) remark: vectorization support: unroll factor set to 4
MATMUL.cc(71): (col. 4) remark: LOOP WAS VECTORIZED
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has unaligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: unaligned access used inside loop body
MATMUL.cc(71): (col. 4) remark: REMAINDER LOOP WAS VECTORIZED
MATMUL.cc(71): (col. 4) remark: loop skipped: multiversioned
Although it says "loop skipped: multiversioned", which I am not sure what it exactly means, the running time is roughly 6X better, which implies the proper vectorization. Using the #pragma omp simd does not change the results.
void MatMul_Par(float* A, float* B, float* C) { //#pragma omp parallel shared(A,B,C) { for (int i=0;i<N;i++) { for(int k=0;k<N;k++) { float temp = A[i*N+k]; //#pragma omp simd for(int j=0;j<N;j++) { C[i*N+j] += temp * B[k*N+j]; } } } } //parallel }
PS: The problem does not exist when using Intel Cilk Plus, etc. It seems to be related to the parallel pragma in OpenMP.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
icpc 14.0 still depends on -ansi-alias -O3 for full optimization; ansi-alias will be a default for linux in 15.0.
Even with an effort to avoid questions on whether a vectorized branch is taken, you would see only a fraction of MKL library performance, so it seems counter-productive to spend much time on this. The beta version of MKL is supposed to improve performance on smaller cases where source code might have been considered in the past.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There's no point in the omp parallel unless you make the outer loop an omp for (parallel) loop.
As long as you get vectorization, there's little need to designate an omp simd loop. Versioning might be a response to your omission of __restrict qualifier or other means of asserting non-overlap of input and output data. Are you relying on inter-procedural optimizations for that purpose?
Normally, you want want an outer threaded loop and an inner vector (simd) loop. Depending on what loop counts the compiler is optimizing for, it may choose some cache blocking. However, you would use the MKL library provided with icc if your matrices are large enough to get full benefit from such optimizations.
As your vector report line numbers don't match your sample, and you haven't said much about your goals or compiler options, I don't think there's much more to say.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! I have found the problem. Using dynamic arrays inside a function that uses vectorization causes the problem.
There are definitely some issues with vectorization when using OpenMP. I will try to post it separately as a complete example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ashkan,
Looks like your code have some issues, you should use "omp parallel for" instead of "omp parallel", they have different meanings.
It should have no issue to use parallelism with vectorization, the Cilk Plus version of the code will be:
void MatMul_Par(float* A, float* B, float* C) {
cilk_for (int i=0;i<N;i++) {
for(int k=0;k<N;k++) {
float temp = A[i*N+k];
#pragma simd
for(int j=0;j<N;j++) {
C[i*N+j] += temp * B[k*N+j];
}
}
}
}
And the equivalent OpenMP version will be:
void MatMul_Par(float* A, float* B, float* C) {
#pragma omp parallel for
for (int i=0;i<N;i++) {
for(int k=0;k<N;k++) {
float temp = A[i*N+k];
#pragma simd
for(int j=0;j<N;j++) {
C[i*N+j] += temp * B[k*N+j];
}
}
}
}
You may provide the complete example if you still have issues, as you mentioned.
Thanks,
Shenghong
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Shenghong,
Thanks a lot for your answer. The problem is that I cannot edit the post. So, it seems to be confusing for people. That's why I wanted to put it as a separate post. but let's discuss it here!
I know the differences between parallel and parallel for, but my point is that the code above (as is) should be able to vectorize the innermost loop.
Now, let's forget about my code and explore your code. I want you to create the arrays dynamically, and pass them to the MatMul_Par function. I have done so. This is what happens:
#include <stdio.h> #include <omp.h> #define N 4096 void MatMul_Par(double* A, double* B, double* C) { #pragma omp parallel for for (int i=0;i<N;i++) { for(int k=0;k<N;k++) { double temp = A[i*N+k]; #pragma simd for(int j=0;j<N;j++) { C[i*N+j] += temp * B[k*N+j]; } } } return; } int main(int argc, char* argv[]){ double *a, *b, *c; a = new double[N*N]; b = new double[N*N]; c = new double[N*N]; for(int i=0;i<N;i++) for(int j=0;j<N;j++) { a[i*N+j]=double(i+j); b[i*N+j]=double(i-j); c[i*N+j]=0.00; } omp_set_num_threads(240); MatMul_Par(a,b,c); free(a);free(b);free(c); return 0; }
This is the result of compilation.
icpc -mmic -openmp -no-offload -vec-report2 -Wall -O2 -std=c++0x intel_q.cc -o intel_q
intel_q.cc(26): (col. 3) remark: LOOP WAS VECTORIZED
intel_q.cc(25): (col. 2) remark: loop was not vectorized: not inner loop
intel_q.cc(32): (col. 2) remark: SIMD LOOP WAS VECTORIZED
intel_q.cc(32): (col. 2) remark: loop was not vectorized: not inner loop
intel_q.cc(32): (col. 2) remark: loop was not vectorized: not inner loop
intel_q.cc(11): (col. 5) remark: SIMD LOOP WAS VECTORIZED
intel_q.cc(8): (col. 3) remark: loop was not vectorized: not inner loop
intel_q.cc(7): (col. 2) remark: loop was not vectorized: not inner loop
As you can see, it says loop at the line 32 is vectorized(or not vectorized!). The thing is that there is no loop at the line 32! It is the function call!
and as a result the code is not vectorized properly. How do I know? by measuring the runtime on the Xeon Phi. The problem can be resolved if I do not use the function call and put the inline code after initialization. You can try it!
What do you think?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Line 32 has an inlined function call. The entire inlined function is attributed with line number 32. This function is also compiled out-of-line.
In both cases, the #pragma simd loop (inner most) was vectorized. When in doubt, look at the disassembly window within the debugger or via VTune. Note, producing disassembly listing is not sufficient when full IPO is enabled.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Thanks, but this is my point here: Why is the runtime equal to the non-vectorized code? While it is not the case when using a plain code (without function call)
When I use the Cilk Plus version, no information is reported about that line (the function call), and it gives the proper vectorized code.
You can just try the code I have posted with this function call and without it to see the difference.
It is a very basic OpenMP code and the fact that it is not functioning as expected means that there are some issues.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
icpc 14.0 still depends on -ansi-alias -O3 for full optimization; ansi-alias will be a default for linux in 15.0.
Even with an effort to avoid questions on whether a vectorized branch is taken, you would see only a fraction of MKL library performance, so it seems counter-productive to spend much time on this. The beta version of MKL is supposed to improve performance on smaller cases where source code might have been considered in the past.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Tim!
The problem can be resolved by using -ansi-alias flag.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-opt-assume-safe-padding is more helpful than other typical options which you omitted. Scraping some rust off my 2 year-old MIC, VTune rates your inner loop vectorization utilization as 6.6 (82% efficient) so it's clearly vectorized, although not at all efficient in terms of fraction of peak pure floating point performance.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page