"#pragma" ivdep not removing asumed vector dependance

Ryo_A_ · ‎02-25-2016

Hi,

We have a question on the behavior of "#pragma ivdep", multi-versioning and assumed vector dependence. We have a workload (LU decomposition) that contains an assumed vector dependence. I did not want to post the whole code here, so I created a short reproducer that has the same behavior.

  const int n = 128;
  float* data = (float*) malloc(sizeof(float)*n*n);
  data[0:n*n] = 1.0f;

  for(int i = 0 ; i < n; i++) {
    for(int j = 0 ; j < n; j++) {
      //#pragma ivdep                                                                                                 
      //#pragma vector always                                                                                         
      //#pragma simd                                                                                                  
      for(int k = 0 ; k < n; k++) {
        data[i*n+k] += data[j*n+k];
      }
    }
  }

There is an assumed vector dependence here because 'n' could be smaller than the vector length, and the optimization report recognizes this and implements multi-versioning. However both versions that it creates are not vectorized. Following is the snippet from optimization report with -qopt-report=5

      LOOP BEGIN at reproducer.cc(15,7)
      <Multiversioned v1>
         remark #25228: Loop multiversioned for Data Dependence
         remark #15344: loop was not vectorized: vector dependence prevents vectorization
         remark #15346: vector dependence: assumed FLOW dependence between data line 16 and data line 16
         remark #15346: vector dependence: assumed ANTI dependence between data line 16 and data line 16
         remark #25438: unrolled without remainder by 2  
      LOOP END

      LOOP BEGIN at reproducer.cc(15,7)
      <Multiversioned v2>
         remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
         remark #25438: unrolled without remainder by 2  
      LOOP END

Multi-version v1 reports that this is an "assumed" dependence, which is what we had expected. Furthermore, adding "#pragma ivdep" does resolve the multi-versioning, but the loop is still unvectorized.

      LOOP BEGIN at reproducer.cc(15,7)
         remark #15344: loop was not vectorized: vector dependence prevents vectorization
         remark #15346: vector dependence: assumed FLOW dependence between data line 16 and data line 16
         remark #15346: vector dependence: assumed ANTI dependence between data line 16 and data line 16
         remark #25438: unrolled without remainder by 2  
      LOOP END

Finally, we were able to vectorize this workload by forcing vectorization with "#pragma simd" (and we indeed got the correct result, along with significant speedup). For some reason "#pragma vector always" refused to vectorize this loop.

We are currently using C++ compiler v16.0.1 for Linux, and the problem also occurred with v16.0.0. But with earlier compilers the same code with multi-versioning had a vectorized and non-vectorized versions, and "#pragma ivdep" removed the assumed vector dependence.

Is this a change in behavior with the 16 compiler? If so, is the proper remedy to replace "#pragma ivdep" with "#pragma simd", or is there a different pragma for ignoring this type of assumed vector dependence?

Thanks in Advance!

Ryo

TimP · ‎02-25-2016

#pragma vector always doesn't deal with this dependency. It is surprising that #pragma ivdep no longer has the vectorizing effect.

With my 16.0.2 compiler, ivdep and the 2 simd alternatives suppress multi-versioning. One might think if the compiler is going to the trouble of making 2 versions in the absence of ivdep, that it would choose one to vectorize.

#pragma omp simd (with the corresponding compile option) appears to have the same effect as #pragma simd.

Yuan_C_Intel · ‎02-26-2016

Hi, Ryo

Thank you for raising the issue.

This should be a regression in 16.0 for pragma ivdep. I just checked 15.0 compiler did generate vectorized loop with #pragma ivdep specified.

>icl /c /O2 /Qopt-report5 reproducer.cc
Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.4.221 Build 20150407
Copyright (C) 1985-2015 Intel Corporation.  All rights reserved.

icl: remark #10397: optimization reports are generated in *.optrpt files in the output location
reproducer.cc

Optimization report:

LOOP BEGIN at C:\...\reproducer.cc(10,1)
   remark #25101: Loop Interchange not done due to: Original Order seems proper
   remark #25452: Original Order found to be proper, but by a close margin
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at C:\...\reproducer.cc(11,3)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at C:\...\reproducer.cc(14,5)
      <Peeled>
      LOOP END

      LOOP BEGIN at C:\...\reproducer.cc(14,5)
         remark #15388: vectorization support: reference data has aligned access   [ C:\...\reproducer.cc(15,7) ]
         remark #15388: vectorization support: reference data has aligned access   [ C:\...\reproducer.cc(15,7) ]
         remark #15388: vectorization support: reference data has aligned access   [ C:\...\reproducer.cc(15,7) ]
         remark #15399: vectorization support: unroll factor set to 2
         remark #15300: LOOP WAS VECTORIZED
         remark #15442: entire loop may be executed in remainder
         remark #15448: unmasked aligned unit stride loads: 2 
         remark #15449: unmasked aligned unit stride stores: 1 
         remark #15475: --- begin vector loop cost summary ---
         remark #15476: scalar loop cost: 15 
         remark #15477: vector loop cost: 1.250 
         remark #15478: estimated potential speedup: 6.760 
         remark #15479: lightweight vector operations: 5 
         remark #15488: --- end vector loop cost summary ---
         remark #25015: Estimate of max trip count of loop=16
      LOOP END

      LOOP BEGIN at C:\...\reproducer.cc(14,5)
      <Alternate Alignment Vectorized Loop>
         remark #25015: Estimate of max trip count of loop=16
      LOOP END

      LOOP BEGIN at C:\...\reproducer.cc(14,5)
      <Remainder>
         remark #15388: vectorization support: reference data has aligned access   [ C:\...\reproducer.cc(15,7) ]
         remark #15388: vectorization support: reference data has aligned access   [ C:\...\reproducer.cc(15,7) ]
         remark #15389: vectorization support: reference data has unaligned access   [ C:\...\reproducer.cc(15,7) ]
         remark #15381: vectorization support: unaligned access used inside loop body
         remark #15301: REMAINDER LOOP WAS VECTORIZED
      LOOP END

      LOOP BEGIN at C:\...\reproducer.cc(14,5)
      <Remainder>
      LOOP END
   LOOP END
LOOP EN

I am entering this in our problem tracking system. We will try to resolve this issue as soon as we can. I will let you know when I have an update on this issue.

Thanks.

Ryo_A_ · ‎02-26-2016

Hi Tim and Yolanda,

Thank you for the quick responses. We will use "#pragma simd" for now then, and wait for the update of the compiler.

Thanks!

Ryo