- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
I am trying to use ICC to compile the following benchmark:
#include <stdio.h> int main (void) { int sum=0; for (int i = 0; i < 100; i++) { sum += 1; } printf ("sum=%d\n",sum); return 0; }
Looking at the report provided by the compiler, ICC marked as firstlastprivate the variable “sum”; however, that should be a reduction. The induction variable “i” is assigned as firstprivate, but it is declared inside the loop. The output of this program is correct, but it does not look like the report is describing what happened inside the compiler. I am sending the report I have for this loop:
LOOP BEGIN at test.c(5,3) remark #17109: LOOP WAS AUTO-PARALLELIZED remark #17101: parallel loop shared={ } private={ } firstprivate={ i } lastprivate={ } firstlastprivate={ sum } reduction={ } remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=100 LOOP END
I was expecting something like the following report:
LOOP BEGIN at test.c(5,3) remark #17109: LOOP WAS AUTO-PARALLELIZED remark #17101: parallel loop shared={ } private={ } firstprivate={ } lastprivate={ } firstlastprivate={ } reduction={ sum } remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=100 LOOP END
My machine specifications are:
Lenovo Legion Y7000 16 Gb ram i7 8th gen, ubuntu 18.04.
- Tags:
- CC++
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Optimization
- Parallel Computing
- Vectorization
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The compiler optimization should not have parallelized this loop. It knew the trip count and knew the computational weight of the loop iteration. Does examining the code show a parallel region for this loop? If not, then the report is in error but the code is correct.
Reduction should have occurred (+:sum)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the feedback. I have set the threshold to 0 to certify that the compiler has optimized the program. I used the following flags:
icc -w -par-threshold0 -no-vec -fno-inline -parallel -qopt-report-phase=all -qopt-report=3
I am sending the report also:
Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See "https://software.intel.com/en-us/intel-advisor-xe" for details. Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.4.243 Build 20190416 Compiler options: -w -par-threshold0 -no-vec -fno-inline -parallel -qopt-report-phase=all -qopt-report=3 -o test.out Report from: Interprocedural optimizations [ipo] INLINING OPTION VALUES: -inline-factor: 100 -inline-min-size: 30 -inline-max-size: 230 -inline-max-total-size: 2000 -inline-max-per-routine: 10000 -inline-max-per-compile: 500000 Begin optimization report for: main(void) Report from: Interprocedural optimizations [ipo] INLINE REPORT: (main(void)) [1] reduction.c(3,17) -> EXTERN: (8,3) printf(const char *__restrict__, ...) Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par] LOOP BEGIN at reduction.c(5,3) remark #17109: LOOP WAS AUTO-PARALLELIZED remark #17101: parallel loop shared={ } private={ } firstprivate={ i } lastprivate={ } firstlastprivate={ sum } reduction={ } remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=100 LOOP END LOOP BEGIN at reduction.c(5,3) remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=100 LOOP END Report from: Code generation optimizations [cg] reduction.c(3,17):remark #34051: REGISTER ALLOCATION : [main] reduction.c:3 Hardware registers Reserved : 2[ rsp rip] Available : 39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15] Callee-save : 6[ rbx rbp r12-r15] Assigned : 12[ rax rdx rcx rbx rsi rdi r8-r13] Routine temporaries Total : 80 Global : 10 Local : 70 Regenerable : 40 Spilled : 0 Routine stack Variables : 28 bytes* Reads : 6 [3.00e+02 ~ 32.6%] Writes : 8 [0.00e+00 ~ 0.0%] Spills : 40 bytes* Reads : 10 [5.00e+00 ~ 0.5%] Writes : 10 [0.00e+00 ~ 0.0%] Notes *Non-overlapping variables and spills may share stack space, so the total stack size might be less than this. ===========================================================================
The report looks buggy to me, but the program itself is correct.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
souza diniz mendonca, gleison wrote:I have set the threshold to 0 to certify that the compiler has optimized the program.
Optimization flags are -O flags. by -par-threshold0 you force compiler to make parallel version despite there is no compute which does not mean optimization, this means slowdown.
souza diniz mendonca, gleison wrote:The report looks buggy to me, but the program itself is correct.
your code is simple and it looks it can be pre-computed on during compilation. try this more complicated code:
#include <stdio.h> #include <math.h> int main (int q, char**) { int sum=0; int len=q*100; for (int i = 0; i < len; i++) { sum += q*exp(len); } printf ("sum=%d\n",sum); return 0; }
and you can get what you expect
LOOP BEGIN at test2.cpp(7,3) remark #17109: LOOP WAS AUTO-PARALLELIZED remark #17101: parallel loop shared={ .2 } private={ } firstprivate={ len q i } lastprivate={ } firstlastprivate={ } reduction={ sum } remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag remark #25439: unrolled with remainder by 8 LOOP END
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> I have set the threshold to 0 to certify that the compiler has optimized the program.
All loops that are parallelized have overhead associated with the resumption (or initialization on first instance) of the threads of the thread team. This may include a wakeup of suspended threads or at minimal the interaction with other threads by means of spinwait condition variables. The overhead of this coordination must be taken into consideration when deciding to parallelize loops. A loop, such as listed above:
When not optimized has 200 incriments + 100 tests + 100 branches. The runtime of this loop in serial is about 1/1000th that of the overhead (which depends on number of threads).
When your original loop is optimized, I suspect the Intel compiler will determine the results of the loop at compile time, however, if you force parallelization of this loop, the loop has 1/infinity that of the overhead.
You should only set threshold to 0 when you know (or suspect or determine) the compiler will not parallize the loop when you know it ought to be parallelized. ...AND you should know the expense of parallelization and as such know that your loop ought not be parallelized.
Jim Dmepsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page