- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
i got a kernel function, which i like to vectorize. The compiler report (icpc (ICC) 18.0.2 20180210) states some dependencies, of which i know, that i can ignore them. Though the code contains two different dependencies, not listed in the report. (For reference, i will add code and the vectorization report later on) Putting a #pragma ivdep in front of the loop will create vectorized code, though i am not sure, whether it deals correctly with the dependencies ( in a slightly more specialized variant, the auto-vectorization resolved them without any pragma and warning)
My question is: does the report always list all relevant dependencies such that ignoring them with ivdep (if irrelevant) will not remove treatment of any "skipped" dependencies?
The vectorization report for my example (icpc (ICC) 18.0.2 20180210):
LOOP BEGIN at /home/hpc/pr27be/ga92gac3/lrr_repo/scrimppp/src/ScrimpDistribPar.cpp(231,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed ANTI dependence between s_hor
remark #15346: vector dependence: assumed FLOW dependence between prof_rowmin (241:5) and s_hor
LOOP END
While those should not be a problem (actually strange, why the const s_hor is reported as a dependency anyways?) lines 240 and 241 as well 251 and 252 contain dependendieces which need some treatmen, but are not listed in the report.
So, will the be properly handled, if i put the #pragma ivdep before the loop?
LOOP BEGIN at /home/hpc/pr27be/ga92gac3/lrr_repo/scrimppp/src/ScrimpDistribPar.cpp(231,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed ANTI dependence between s_hor
remark #15346: vector dependence: assumed FLOW dependence between prof_rowmin (241:5) and s_hor
LOOP END
201 void ScrimpDistribPar::eval_diag_block_triangle( 202 tsa_dtype prof_colmin[], 203 idx_dtype idx_colmin[], 204 tsa_dtype prof_rowmin[], 205 idx_dtype idx_rowmin[], 206 tsa_dtype tmpQ[], 207 const int blocklen, 208 const int trianglen, 209 const tsa_dtype A_hor[], 210 const tsa_dtype A_vert[], 211 const int windowSize, 212 const tsa_dtype s_hor[], 213 const tsa_dtype mu_hor[], 214 const tsa_dtype s_vert[], 215 const tsa_dtype mu_vert[], 216 const idx_dtype baserow, 217 const idx_dtype basecol 218 ) 219 { 220 EXEC_TRACE("evaluate block of diagonals in triangle. Triangle length: " << trianglen << " blocklen " << blocklen); 221 222 //iteration in diagonal direction for all of the blocked diagonals. 223 //the loop is expressed in terms of the column-coordinate 224 for (idx_dtype j=0; j<trianglen; j++) 225 { 226 tsa_dtype profile_j = prof_colmin; 227 idx_dtype index_j = idx_colmin ; 228 229 //iteration over all diagonals in the block. Handling incomplete blocks with the "iterlimit" 230 const int iterlimit = j+std::min(blocklen, trianglen-j); 231 for (idx_dtype i=j; i < iterlimit; ++i) 232 { 233 const idx_dtype diag = i-j; 234 const tsa_dtype corrScore = tmpQ[diag]* (s_vert * s_hor ) - mu_vert * mu_hor ; 235 EXEC_TRACE ("eval i: " << j+basecol << " j: " << i+baserow << " lastz " << tmpQ[diag] << " mu_h_j " << mu_hor ); //logging for debugging 236 237 tmpQ[diag] += A_vert[i+windowSize]*A_hor[j+windowSize] ; //- A_vert*A_hor ; 238 tmpQ[diag] -= A_vert*A_hor ; 239 240 if (corrScore > prof_rowmin) { 241 prof_rowmin = corrScore; 242 idx_rowmin = j+basecol; 243 } 244 245 if (corrScore > profile_j) { 246 profile_j = corrScore; 247 index_j = i+baserow; 248 } 249 } 250 //integration of the result in i direction into memory 251 if (profile_j > prof_colmin ) { 252 prof_colmin = profile_j; 253 idx_colmin = index_j; 254 } 255 } 256 }
- Tags:
- CC++
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Optimization
- Parallel Computing
- Vectorization
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may find it useful to try less extreme pragmas than ivdep. For example, the #pragma vector and #pragma omp simd families of pragmas suspend compiler's attempt to judge whether vectorization will gain performance, without ignoring all dependencies, with the omp simd also ignoring aliasing dependencies. profile_j should be detected as a max reduction (don't use omp simd without declaring the reduction), but in such complicated context may cause the compiler to give up on assessing performance gains. index_j appears to have a firstprivate lastprivate requirement, which can't in general be vectorized correctly. What has worked for a given example with one compiler version has failed with version upgrade. In principle, it might be handled with a simd omp user defined reduction, but I can't demonstrate that in practice.
Defining local scalar copies of all the

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page