Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7943 Discussions

A simple loop cannot be parallelized, why??

heavenbird
Beginner
999 Views
Hello,

I have a very simple loop, it assigns a row of matrix to an empty double array.
#pragma parallel
#pragma ivdep
for(unsigned int ii=0;ii
{
temp_row[ii]=data[ii*num_row+which_row];
}

where "data" is an array contains the matrix elements, and this matrix is column-majored.
num_row is number of rows in the matrix, num_col is number of columns in the matrix.
And I always got this message telling me that this simple loop cannot be parallelized:
remark: loop was not parallelized: existence of parallel dependence.
parallel dependence: assumed OUTPUT dependence between data line 158 and data line 158.


Can anyone tell me why and how can I rewrite this piece of code to make it parallelized ?
Thanks !!
Haining Wong
0 Kudos
1 Solution
Milind_Kulkarni__Int
New Contributor II
999 Views
[bash][/bash]

Please try using openmp pragma here , since the auto-parallelization looks not working here for reasons I do not know. Do something like #pragma omp parallel . In that case, the omp region gets parallelized. You may though keep -parallel option , for other parts of the program. In this case, with -openmp -parallel, you would get openmp region parallelized, while still having messages.

In this case since there is no inner (fine-grained) loop, the already parallelized loop does not get vectorized, as the message shows when you have -openmp option. But anyway, I hope the openmp pragma would be better to work with here, so the loop gets parallelized. So you can try with openmp pragma instead of auto-parallel feature here, though the compiler should have applied auto-parallel feature.


Sample code:

void row(double* restrict matrix, double* restrict filled_row,int col_num, int row_num, int which_row)

{

int col_num_local=col_num;

#pragma omp parallel for

#pragma ivdep

// #pragma vector always

for(int ii=0;ii

{ filled_row[ii]=matrix[ii*row_num+which_row];} // This is line 22

}

int main()

{

double matrix[2000]={0};

double filled_row[1500]={0};

row(matrix, filled_row, 1000, 2, 10);

return 1;

}

Please give it a try and let me know.

View solution in original post

0 Kudos
11 Replies
Milind_Kulkarni__Int
New Contributor II
999 Views
Could you be more specific about the test-code, so that the data types will be more clear.

There are certainly many cases where unsigned short/int arithmetic restrains vectorization. Also, try to remove unsigned from the loop index and data arithmetic to check for any progress.

Also, you may try using #pragma vector always , also check for /c /Qrestrict and the restrict keyword from the compiler documentation.

Please provide piece of code, and the command-line options that are using.
0 Kudos
heavenbird
Beginner
999 Views
Thank you for your reply Milind.
I have removed unsignedas you suggested, but compiler gives the same complain.
Here is the piece of my testing code:
void row(double* __restrict matrix, double* __restrict filled_row,int col_num, int row_num, int which_row)
{
int col_num_local=col_num;
#pragma parallel
#pragma ivdep
#pragma vector always
for(int ii=0;ii
{ filled_row[ii]=matrix[ii*row_num+which_row];} // This is line 22
}
here is my command-line options:
-O3 -ip -inline-level=2 -parallel -par-threshold0 -vec-threshold0 -mkl=parallel -I/opt/intel/Compiler/11.1/072/mkl/include -openmp -openmp-report1 -par_report3 -vec-report2 -fp-model strict -axSSE4.2 -xSSE4.2 -zp16
Adn This what compiler complains:
(col. 2) remark: loop was not parallelized: existence of parallel dependence.
(col. 4) remark: parallel dependence: assumed OUTPUT dependence between matrix line 22 and matrix line 22.
(col. 4) remark: parallel dependence: assumed OUTPUT dependence between matrix line 22 and matrix line 22.
(col. 4) remark: loop was not vectorized: statement cannot be vectorized.
Thank you !
Haining Wong
0 Kudos
Milind_Kulkarni__Int
New Contributor II
1,000 Views
[bash][/bash]

Please try using openmp pragma here , since the auto-parallelization looks not working here for reasons I do not know. Do something like #pragma omp parallel . In that case, the omp region gets parallelized. You may though keep -parallel option , for other parts of the program. In this case, with -openmp -parallel, you would get openmp region parallelized, while still having messages.

In this case since there is no inner (fine-grained) loop, the already parallelized loop does not get vectorized, as the message shows when you have -openmp option. But anyway, I hope the openmp pragma would be better to work with here, so the loop gets parallelized. So you can try with openmp pragma instead of auto-parallel feature here, though the compiler should have applied auto-parallel feature.


Sample code:

void row(double* restrict matrix, double* restrict filled_row,int col_num, int row_num, int which_row)

{

int col_num_local=col_num;

#pragma omp parallel for

#pragma ivdep

// #pragma vector always

for(int ii=0;ii

{ filled_row[ii]=matrix[ii*row_num+which_row];} // This is line 22

}

int main()

{

double matrix[2000]={0};

double filled_row[1500]={0};

row(matrix, filled_row, 1000, 2, 10);

return 1;

}

Please give it a try and let me know.

0 Kudos
heavenbird
Beginner
999 Views
Thanks Milind !
your "omp" solution worked.
Really appreciate your help!
Haining Wong
0 Kudos
Milind_Kulkarni__Int
New Contributor II
999 Views
you are welcome, mr. wong!
0 Kudos
kfsone
New Contributor I
999 Views
It may be unable to parallelize because it isn't 100% certain that row_num and which_row aren't changing.
The original code may work if you mark row_num and which_row as const in the function parameters:
[cpp]void row(double* restrict matrix, const double* const restrict filled_row, const size_t col_num, const size_t row_num, const size_t which_row)
{
  #pragma ivdep
  #pragma vector always
  for ( size_t ii = 0 ; ii < col_num ; ++ii )
  {
    const size_t matrix_ii = (ii * row_num) + which_row ;
    filled_row[ii] = matrix[matrix_ii] ;
  }
}
[/cpp]
0 Kudos
TimP
Honored Contributor III
999 Views
It looks like parallelization would be problematical unless which_row < row_num. In the case where which_row is a multiple of row_num, a clear race condition exists. You would expect ivdep to rule that out unless the compiler is fairly certain that the race condition will appear.
0 Kudos
heavenbird
Beginner
999 Views
Hello kfsone,
Thanks for your tip.
I've tried your code, I found that this code works in a regular function. However, once this piece of code is used in a class member function to initialize a row of a matrix, it's very difficult to get it parallelized. I guess intel c++ compiler is very conservative dealing with class.
Do you have a similar experience ?
Thanks
Haining
0 Kudos
heavenbird
Beginner
999 Views
Thank you Tim,
There are certainly a lot to be learned even in auto-parallelization for me.
Being a PhD student, I've trying my best to convert myself from Matlab to Intel MKL & C++ compiler.
Afterhundreds of hours invested, I found that the real weakness of Intel C++ compiler is the DOCUMENTATION. For example, there should be a chapter regarding how to write auto-vectorizable/ parallelizable code, and how to interpret the compiler remarks.
Haining
0 Kudos
Om_S_Intel
Employee
999 Views

I think the information on vectorization, parallelization is available in Intel Compiler user and reference guide in chapter "Optimizing Applications".

0 Kudos
TimP
Honored Contributor III
999 Views
Intel Parallel Studio is aimed at diagnosing parallelization problems. The question you raised doesn't appear to be specific to any single vendor's compiler.
The need for automated tools for checking parallel code was recognized years ago; the lineage of Parallel Studio goes back through Intel Thread Checker to KAI Assure.
0 Kudos
Reply