ivdep: save to ignore possibly incomplete list of dependencies

Pfeilschifter__Gabri · ‎11-28-2018

Hello,

i got a kernel function, which i like to vectorize. The compiler report (icpc (ICC) 18.0.2 20180210) states some dependencies, of which i know, that i can ignore them. Though the code contains two different dependencies, not listed in the report. (For reference, i will add code and the vectorization report later on) Putting a #pragma ivdep in front of the loop will create vectorized code, though i am not sure, whether it deals correctly with the dependencies ( in a slightly more specialized variant, the auto-vectorization resolved them without any pragma and warning)
My question is: does the report always list all relevant dependencies such that ignoring them with ivdep (if irrelevant) will not remove treatment of any "skipped" dependencies?

The vectorization report for my example (icpc (ICC) 18.0.2 20180210):
LOOP BEGIN at /home/hpc/pr27be/ga92gac3/lrr_repo/scrimppp/src/ScrimpDistribPar.cpp(231,3)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: assumed ANTI dependence between s_hor (234:82) and prof_rowmin (241:5)
      remark #15346: vector dependence: assumed FLOW dependence between prof_rowmin (241:5) and s_hor (234:82)
   LOOP END

While those should not be a problem (actually strange, why the const s_hor is reported as a dependency anyways?) lines 240 and 241 as well 251 and 252 contain dependendieces which need some treatmen, but are not listed in the report.
So, will the be properly handled, if i put the #pragma ivdep before the loop?

   LOOP BEGIN at /home/hpc/pr27be/ga92gac3/lrr_repo/scrimppp/src/ScrimpDistribPar.cpp(231,3)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: assumed ANTI dependence between s_hor (234:82) and prof_rowmin (241:5)
      remark #15346: vector dependence: assumed FLOW dependence between prof_rowmin (241:5) and s_hor (234:82)
   LOOP END

201 void ScrimpDistribPar::eval_diag_block_triangle(
 202         tsa_dtype prof_colmin[],
 203         idx_dtype idx_colmin[],
 204         tsa_dtype prof_rowmin[],
 205         idx_dtype idx_rowmin[],
 206         tsa_dtype tmpQ[],
 207         const int blocklen,
 208         const int trianglen,
 209         const tsa_dtype A_hor[],
 210         const tsa_dtype A_vert[],
 211         const int windowSize,
 212         const tsa_dtype s_hor[],
 213         const tsa_dtype mu_hor[],
 214         const tsa_dtype s_vert[],
 215         const tsa_dtype mu_vert[],
 216         const idx_dtype baserow,
 217         const idx_dtype basecol
 218     )
 219 {
 220         EXEC_TRACE("evaluate block of diagonals in triangle. Triangle length: " << trianglen << " blocklen " << blocklen);
 221 
 222         //iteration in diagonal direction for all of the blocked diagonals.
 223             //the loop is expressed in terms of the column-coordinate
 224         for (idx_dtype j=0; j<trianglen; j++)
 225         {
 226                 tsa_dtype profile_j = prof_colmin;
 227                 idx_dtype index_j = idx_colmin;
 228 
 229                 //iteration over all diagonals in the block. Handling incomplete blocks with the "iterlimit"
 230                 const int iterlimit = j+std::min(blocklen, trianglen-j);
 231                 for (idx_dtype i=j; i < iterlimit; ++i)
 232                 {
 233                         const idx_dtype diag = i-j;
 234                         const tsa_dtype corrScore = tmpQ[diag]* (s_vert * s_hor) - mu_vert * mu_hor;
 235                         EXEC_TRACE ("eval i: " << j+basecol << " j: " << i+baserow  << " lastz " << tmpQ[diag] << " mu_h_j " << mu_hor); //logging for debugging
 236 
 237                         tmpQ[diag] += A_vert[i+windowSize]*A_hor[j+windowSize]  ; //- A_vert*A_hor;
 238                         tmpQ[diag] -= A_vert*A_hor;
 239 
 240                         if (corrScore > prof_rowmin) {
 241                                 prof_rowmin = corrScore;
 242                                 idx_rowmin = j+basecol;
 243                         }
 244 
 245                         if (corrScore > profile_j) {
 246                                 profile_j = corrScore;
 247                                 index_j = i+baserow;
 248                         }
 249                 }
 250                 //integration of the result in i direction into memory
 251                 if (profile_j > prof_colmin) {
 252                         prof_colmin = profile_j;
 253                         idx_colmin = index_j;
 254                 }
 255         }
 256 }

TimP · ‎11-28-2018

You may find it useful to try less extreme pragmas than ivdep. For example, the #pragma vector and #pragma omp simd families of pragmas suspend compiler's attempt to judge whether vectorization will gain performance, without ignoring all dependencies, with the omp simd also ignoring aliasing dependencies. profile_j should be detected as a max reduction (don't use omp simd without declaring the reduction), but in such complicated context may cause the compiler to give up on assessing performance gains. index_j appears to have a firstprivate lastprivate requirement, which can't in general be vectorized correctly. What has worked for a given example with one compiler version has failed with version upgrade. In principle, it might be handled with a simd omp user defined reduction, but I can't demonstrate that in practice.

Defining local scalar copies of all the elements may help in avoiding false dependencies; your compiler report may be identifying some.