- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have a question on the behavior of "#pragma ivdep", multi-versioning and assumed vector dependence. We have a workload (LU decomposition) that contains an assumed vector dependence. I did not want to post the whole code here, so I created a short reproducer that has the same behavior.
const int n = 128; float* data = (float*) malloc(sizeof(float)*n*n); data[0:n*n] = 1.0f; for(int i = 0 ; i < n; i++) { for(int j = 0 ; j < n; j++) { //#pragma ivdep //#pragma vector always //#pragma simd for(int k = 0 ; k < n; k++) { data[i*n+k] += data[j*n+k]; } } }
There is an assumed vector dependence here because 'n' could be smaller than the vector length, and the optimization report recognizes this and implements multi-versioning. However both versions that it creates are not vectorized. Following is the snippet from optimization report with -qopt-report=5
LOOP BEGIN at reproducer.cc(15,7) <Multiversioned v1> remark #25228: Loop multiversioned for Data Dependence remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed FLOW dependence between data line 16 and data line 16 remark #15346: vector dependence: assumed ANTI dependence between data line 16 and data line 16 remark #25438: unrolled without remainder by 2 LOOP END LOOP BEGIN at reproducer.cc(15,7) <Multiversioned v2> remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning remark #25438: unrolled without remainder by 2 LOOP END
Multi-version v1 reports that this is an "assumed" dependence, which is what we had expected. Furthermore, adding "#pragma ivdep" does resolve the multi-versioning, but the loop is still unvectorized.
LOOP BEGIN at reproducer.cc(15,7) remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed FLOW dependence between data line 16 and data line 16 remark #15346: vector dependence: assumed ANTI dependence between data line 16 and data line 16 remark #25438: unrolled without remainder by 2 LOOP END
Finally, we were able to vectorize this workload by forcing vectorization with "#pragma simd" (and we indeed got the correct result, along with significant speedup). For some reason "#pragma vector always" refused to vectorize this loop.
We are currently using C++ compiler v16.0.1 for Linux, and the problem also occurred with v16.0.0. But with earlier compilers the same code with multi-versioning had a vectorized and non-vectorized versions, and "#pragma ivdep" removed the assumed vector dependence.
Is this a change in behavior with the 16 compiler? If so, is the proper remedy to replace "#pragma ivdep" with "#pragma simd", or is there a different pragma for ignoring this type of assumed vector dependence?
Thanks in Advance!
Ryo
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#pragma vector always doesn't deal with this dependency. It is surprising that #pragma ivdep no longer has the vectorizing effect.
With my 16.0.2 compiler, ivdep and the 2 simd alternatives suppress multi-versioning. One might think if the compiler is going to the trouble of making 2 versions in the absence of ivdep, that it would choose one to vectorize.
#pragma omp simd (with the corresponding compile option) appears to have the same effect as #pragma simd.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Ryo
Thank you for raising the issue.
This should be a regression in 16.0 for pragma ivdep. I just checked 15.0 compiler did generate vectorized loop with #pragma ivdep specified.
>icl /c /O2 /Qopt-report5 reproducer.cc Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.4.221 Build 20150407 Copyright (C) 1985-2015 Intel Corporation. All rights reserved. icl: remark #10397: optimization reports are generated in *.optrpt files in the output location reproducer.cc
Optimization report:
LOOP BEGIN at C:\...\reproducer.cc(10,1) remark #25101: Loop Interchange not done due to: Original Order seems proper remark #25452: Original Order found to be proper, but by a close margin remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at C:\...\reproducer.cc(11,3) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at C:\...\reproducer.cc(14,5) <Peeled> LOOP END LOOP BEGIN at C:\...\reproducer.cc(14,5) remark #15388: vectorization support: reference data has aligned access [ C:\...\reproducer.cc(15,7) ] remark #15388: vectorization support: reference data has aligned access [ C:\...\reproducer.cc(15,7) ] remark #15388: vectorization support: reference data has aligned access [ C:\...\reproducer.cc(15,7) ] remark #15399: vectorization support: unroll factor set to 2 remark #15300: LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15448: unmasked aligned unit stride loads: 2 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 15 remark #15477: vector loop cost: 1.250 remark #15478: estimated potential speedup: 6.760 remark #15479: lightweight vector operations: 5 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=16 LOOP END LOOP BEGIN at C:\...\reproducer.cc(14,5) <Alternate Alignment Vectorized Loop> remark #25015: Estimate of max trip count of loop=16 LOOP END LOOP BEGIN at C:\...\reproducer.cc(14,5) <Remainder> remark #15388: vectorization support: reference data has aligned access [ C:\...\reproducer.cc(15,7) ] remark #15388: vectorization support: reference data has aligned access [ C:\...\reproducer.cc(15,7) ] remark #15389: vectorization support: reference data has unaligned access [ C:\...\reproducer.cc(15,7) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP END LOOP BEGIN at C:\...\reproducer.cc(14,5) <Remainder> LOOP END LOOP END LOOP EN
I am entering this in our problem tracking system. We will try to resolve this issue as soon as we can. I will let you know when I have an update on this issue.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim and Yolanda,
Thank you for the quick responses. We will use "#pragma simd" for now then, and wait for the update of the compiler.
Thanks!
Ryo
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page