>> I have set the threshold

souza_diniz_mendonca · ‎10-01-2019

Hi there,

I am trying to use ICC to compile the following benchmark:

#include <stdio.h>
int main (void)
{
  int sum=0;
  for (int i = 0; i < 100; i++)
  {
    sum += 1;
  }
  printf ("sum=%d\n",sum);
  return 0;
}

Looking at the report provided by the compiler, ICC marked as firstlastprivate the variable “sum”; however, that should be a reduction. The induction variable “i” is assigned as firstprivate, but it is declared inside the loop. The output of this program is correct, but it does not look like the report is describing what happened inside the compiler. I am sending the report I have for this loop:

LOOP BEGIN at test.c(5,3)
   remark #17109: LOOP WAS AUTO-PARALLELIZED
   remark #17101: parallel loop shared={ } private={ } firstprivate={ i } lastprivate={ } firstlastprivate={ sum } reduction={ }
   remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=100
LOOP END

I was expecting something like the following report:

LOOP BEGIN at test.c(5,3)
   remark #17109: LOOP WAS AUTO-PARALLELIZED
   remark #17101: parallel loop shared={ } private={ } firstprivate={ } lastprivate={ } firstlastprivate={ } reduction={ sum }
   remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=100
LOOP END

My machine specifications are:

Lenovo Legion Y7000 16 Gb ram i7 8th gen, ubuntu 18.04.

jimdempseyatthecove · ‎10-02-2019

The compiler optimization should not have parallelized this loop. It knew the trip count and knew the computational weight of the loop iteration. Does examining the code show a parallel region for this loop? If not, then the report is in error but the code is correct.

Reduction should have occurred (+:sum)

Jim Dempsey

souza_diniz_mendonca · ‎10-08-2019

Thanks for the feedback. I have set the threshold to 0 to certify that the compiler has optimized the program. I used the following flags:

icc -w -par-threshold0 -no-vec -fno-inline -parallel -qopt-report-phase=all -qopt-report=3

I am sending the report also:

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.

Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.4.243 Build 20190416

Compiler options: -w -par-threshold0 -no-vec -fno-inline -parallel -qopt-report-phase=all -qopt-report=3 -o test.out

    Report from: Interprocedural optimizations [ipo]

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000


Begin optimization report for: main(void)

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (main(void)) [1] reduction.c(3,17)
  -> EXTERN: (8,3) printf(const char *__restrict__, ...)


    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at reduction.c(5,3)
   remark #17109: LOOP WAS AUTO-PARALLELIZED
   remark #17101: parallel loop shared={ } private={ } firstprivate={ i } lastprivate={ } firstlastprivate={ sum } reduction={ }
   remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=100
LOOP END

LOOP BEGIN at reduction.c(5,3)
   remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=100
LOOP END

    Report from: Code generation optimizations [cg]

reduction.c(3,17):remark #34051: REGISTER ALLOCATION : [main] reduction.c:3

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :   12[ rax rdx rcx rbx rsi rdi r8-r13]
        
    Routine temporaries
        Total         :      80
            Global    :      10
            Local     :      70
        Regenerable   :      40
        Spilled       :       0
        
    Routine stack
        Variables     :      28 bytes*
            Reads     :       6 [3.00e+02 ~ 32.6%]
            Writes    :       8 [0.00e+00 ~ 0.0%]
        Spills        :      40 bytes*
            Reads     :      10 [5.00e+00 ~ 0.5%]
            Writes    :      10 [0.00e+00 ~ 0.0%]
    
    Notes
    
        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.
    

===========================================================================

The report looks buggy to me, but the program itself is correct.

Vladimir_P_1234567890 · ‎01-19-2020

souza diniz mendonca, gleison wrote:
I have set the threshold to 0 to certify that the compiler has optimized the program.

Optimization flags are -O flags. by -par-threshold0 you force compiler to make parallel version despite there is no compute which does not mean optimization, this means slowdown.

souza diniz mendonca, gleison wrote:
The report looks buggy to me, but the program itself is correct.

your code is simple and it looks it can be pre-computed on during compilation. try this more complicated code:

#include <stdio.h>
#include <math.h>
int main (int q, char**)
{
  int sum=0;
  int len=q*100;
  for (int i = 0; i < len; i++)
  {
    sum += q*exp(len);
  }
  printf ("sum=%d\n",sum);
  return 0;
}

and you can get what you expect

LOOP BEGIN at test2.cpp(7,3)
   remark #17109: LOOP WAS AUTO-PARALLELIZED
   remark #17101: parallel loop shared={ .2 } private={ } firstprivate={ len q i } lastprivate={ } firstlastprivate={ } reduction={ sum }
   remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag
   remark #25439: unrolled with remainder by 8
LOOP END

jimdempseyatthecove · ‎01-20-2020

>> I have set the threshold to 0 to certify that the compiler has optimized the program.

All loops that are parallelized have overhead associated with the resumption (or initialization on first instance) of the threads of the thread team. This may include a wakeup of suspended threads or at minimal the interaction with other threads by means of spinwait condition variables. The overhead of this coordination must be taken into consideration when deciding to parallelize loops. A loop, such as listed above:

When not optimized has 200 incriments + 100 tests + 100 branches. The runtime of this loop in serial is about 1/1000th that of the overhead (which depends on number of threads).

When your original loop is optimized, I suspect the Intel compiler will determine the results of the loop at compile time, however, if you force parallelization of this loop, the loop has 1/infinity that of the overhead.

You should only set threshold to 0 when you know (or suspect or determine) the compiler will not parallize the loop when you know it ought to be parallelized. ...AND you should know the expense of parallelization and as such know that your loop ought not be parallelized.

Jim Dmepsey

ICC 19.0.4.243 report contains variables denoted as firsprivate when they appear to be reductions.