Quote:jimdempseyatthecove

souza_diniz_mendonca · ‎01-16-2020

Hi there,

I am compiling the code below with ICC using the following command line:

icc -w -par-threshold0 -no-vec -fno-inline -parallel -qopt-report-phase=all -qopt-report=5

 1 #include <stdio.h>
  2 int a[100];
  3
  4 int main(int argc, char *argv[])
  5 {
  6   int len=argc;
  7   int i,x=10;
  8
  9   for (i=0;i<len;i++)
 10   {
 11     a = i;
 12     x=i;
 13   }
 14
 15   for (i = 0; i < len; i++)
 16     printf("%d ", a);
 17   printf("x=%d",x);
 18   return 0;
 19 }

The code is a modification of the following program in AutoParBench:

https://github.com/LLNL/dataracebench/blob/master/micro-benchmarks/DRB016-outputdep-orig-yes.c

The loop pattern in the code has two pair of dependencies:

1. loop carried output dependence

x = .. :

2. loop carried true dependence due to:

.. = x; // a

x = ..;

Below I am showing you the report produced by ICC. It seems that ICC tried to parallelize the loop at lines 9-13.

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.

Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.4.243 Build 20190416

Compiler options: -par-threshold0 -no-vec -fno-inline -parallel -qopt-report-phase=all -qopt-report=5 -o test.out

    Report from: Interprocedural optimizations [ipo]

  WHOLE PROGRAM (SAFE) [EITHER METHOD]: false
  WHOLE PROGRAM (SEEN) [TABLE METHOD]: true
  WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000

In the inlining report below:
   "sz" refers to the "size" of the routine. The smaller a routine's size,
      the more likely it is to be inlined.
   "isz" refers to the "inlined size" of the routine. This is the amount
      the calling routine will grow if the called routine is inlined into it.
      The compiler generally limits the amount a routine can grow by having
      routines inlined into it.

Begin optimization report for: main(int, char **)

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (main(int, char **)) [1/1=100.0%] modified_clean_DRB016-outputdep-orig-yes.c(5,1)
  -> EXTERN: (16,5) printf(const char *__restrict__, ...)
  -> EXTERN: (17,3) printf(const char *__restrict__, ...)


    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(9,3)
   remark #17109: LOOP WAS AUTO-PARALLELIZED
   remark #17101: parallel loop shared={ .2 } private={ } firstprivate={ argc } lastprivate={ } firstlastprivate={ i } reduction={ }
   remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag
   remark #25439: unrolled with remainder by 2  
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=100
LOOP END

LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(9,3)
<Remainder>
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=100
LOOP END

LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(15,3)
   remark #17104: loop was not parallelized: existence of parallel dependence
   remark #15382: vectorization support: call to function printf(const char *__restrict__, ...) cannot be vectorized   [ modified_clean_DRB016-outputdep-orig-yes.c(16,5) ]
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #25015: Estimate of max trip count of loop=100
LOOP END

LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(9,3)
   remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag
   remark #25439: unrolled with remainder by 2  
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=100
LOOP END

LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(9,3)
<Remainder>
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=100
LOOP END

    Report from: Code generation optimizations [cg]

modified_clean_DRB016-outputdep-orig-yes.c(5,1):remark #34051: REGISTER ALLOCATION : [main] modified_clean_DRB016-outputdep-orig-yes.c:5

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :   14[ rax rdx rcx rbx rsi rdi r8-r15]
        
    Routine temporaries
        Total         :     125
            Global    :      33
            Local     :      92
        Regenerable   :      46
        Spilled       :       1
        
    Routine stack
        Variables     :      32 bytes*
            Reads     :       6 [0.00e+00 ~ 0.0%]
            Writes    :       9 [0.00e+00 ~ 0.0%]
        Spills        :      48 bytes*
            Reads     :      11 [5.00e+00 ~ 0.6%]
            Writes    :      11 [0.00e+00 ~ 0.0%]
    
    Notes
    
        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.
    

===========================================================================

However, intel inspector reports a data race in the loop parallelized by ICC. The contents of “log/realtime_mode.log”, generated by intel inspector, follows below.

<?xml version="1.0" encoding="UTF-8"?>
<feedback>
 <message severity="verbose">Analysis started...</message>
 <nop/>
 <message severity="info">Collection started. To stop the collection, either press CTRL-C or enter from another console window: inspxe-cl -r /home/gleison/Desktop/Fernando_modifed_example/r005ti3 -command stop.</message>
 <nop/>
 <message severity="verbose">Result file: /home/gleison/Desktop/Fernando_modifed_example/r005ti3/r005ti3.inspxe </message>
 <nop/>
 <message severity="verbose">Found target process /home/gleison/Desktop/Fernando_modifed_example/test.out (PID = 20895). Analysis started... </message>
 <nop/>
 <message severity="verbose">Loaded module: /home/gleison/Desktop/Fernando_modifed_example/test.out. </message>
 <nop/>
 <message severity="verbose">Loaded module: /lib64/ld-linux-x86-64.so.2. </message>
 <nop/>
 <message severity="verbose">Loaded module: [vdso]. </message>
 <nop/>
 <message severity="verbose">Loaded module: /lib/x86_64-linux-gnu/libm.so.6. </message>
 <nop/>
 <message severity="verbose">Loaded module: /usr/lib/x86_64-linux-gnu/libiomp5.so. </message>
 <nop/>
 <message severity="verbose">Loaded module: /lib/x86_64-linux-gnu/libgcc_s.so.1. </message>
 <nop/>
 <message severity="verbose">Loaded module: /lib/x86_64-linux-gnu/libpthread.so.0. </message>
 <nop/>
 <message severity="verbose">Loaded module: /lib/x86_64-linux-gnu/libc.so.6. </message>
 <nop/>
 <message severity="verbose">Loaded module: /lib/x86_64-linux-gnu/libdl.so.2. </message>
 <nop/>
 <message severity="verbose">Loaded module: /opt/intel/inspector_2019.4.0.597413/lib64/runtime/libittnotify.so. </message>
 <nop/>
 <message severity="warning">One or more threads in the application accessed the stack of another thread. This may indicate one or more bugs in your application. Setting the Inspector to detect data races on stack accesses and running another analysis may help you locate these and other bugs.</message>
 <nop/>
 <message severity="verbose">Unloaded module: /home/gleison/Desktop/Fernando_modifed_example/test.out. </message>
 <nop/>
 <message severity="verbose">Unloaded module: /lib64/ld-linux-x86-64.so.2. </message>
 <nop/>
 <message severity="verbose">Unloaded module: [vdso]. </message>
 <nop/>
 <message severity="verbose">Unloaded module: /lib/x86_64-linux-gnu/libm.so.6. </message>
 <nop/>
 <message severity="verbose">Unloaded module: /usr/lib/x86_64-linux-gnu/libiomp5.so. </message>
 <nop/>
 <message severity="verbose">Unloaded module: /lib/x86_64-linux-gnu/libgcc_s.so.1. </message>
 <nop/>
 <message severity="verbose">Unloaded module: /lib/x86_64-linux-gnu/libpthread.so.0. </message>
 <nop/>
 <message severity="verbose">Unloaded module: /lib/x86_64-linux-gnu/libc.so.6. </message>
 <nop/>
 <message severity="verbose">Unloaded module: /lib/x86_64-linux-gnu/libdl.so.2. </message>
 <nop/>
 <message severity="verbose">Unloaded module: /opt/intel/inspector_2019.4.0.597413/lib64/runtime/libittnotify.so. </message>
 <nop/>
 <message severity="verbose">Process /home/gleison/Desktop/Fernando_modifed_example/test.out (PID = 20895) has terminated. </message>
 <nop/>
 <message severity="verbose">Application exit code: 0 </message>
 <nop/>
 <message severity="verbose">Result file: /home/gleison/Desktop/Fernando_modifed_example/r005ti3/r005ti3.inspxe </message>
 <nop/>
 <message severity="verbose">Analysis completed</message>
 <nop/>
 <message severity="info">  </message>
 <nop/>
 <message severity="info">1 new problem(s) found </message>
 <nop/>
 <message severity="info">    1 Data race problem(s) detected </message>
 <nop/>
</feedback>

The loop has a race condition on a[10], which, if run in parallel, can receive either integers 10 (first iteration) or 11 (tenth iteration).

Regards,

Gleison

jimdempseyatthecove · ‎01-16-2020

The above loop is not auto-parallization conformant due to x being shared (though it is more than likely that it is registerized, and by accident benign).

** also, at issue is if the line 11 statement remains as-is, there is no possible way that more than one thread, each with would insert the transient value of 0 into a[10].

I do not think it is meaningful to post a query of a problem using non-conformant code.

Note, #pragma parallel firstprivate(x) would not correct the programming problem due to the different starting point value for i by each thread.

Jim Dempsey

jimdempseyatthecove · ‎01-16-2020

The loop could potentially be properly constructed (same result as serial code) at -O3 provided that a[10]=0 gets lifted out of the loop (lifting of loop invariant code) and then (optionally) being elided due to results not used before being replaced. The compiler writers are very smart and it is possible that this degree of optimization could be done at -O3 optimization level.

Jim Dempsey

Pereira__Fernando · ‎01-17-2020

Dear Jim,

Thank you for looking into this rather long report. I was working with Gleison in this program, when we saw that ICC was parallelizing the loop.

> also, at issue is if the line 11 statement remains as-is, there is no possible way that more than one thread, each with would insert the transient value of 0 into a[10].

I did not understand this part of your answer. The original loop that was parallelized (according to the ICC report) was:

// len == 100
for (i=0;i<len;i++) {
  a = i;
  x=i;
}

This code would gives us the following assignments at different iterations, where each iteration is determined by 'i':

i==0) a[10] = 0; x = 0;
i==1) a[0] = 1; x = 1;
i==2) a[1] = 2; x = 2;
...
i==11) a[10] = 11; x = 11;
...

There would be an assignment of '0' and '11' at 'a[10]' if the code runs in parallel. By using the command in the bug report (icc -w -par-threshold0 -no-vec -fno-inline -parallel -qopt-report-phase=all -qopt-report=5), are we forcing ICC to parallelize a loop that is not parallelizable? I mean to say: are we breaking some pre-conditions expected by ICC, or would it be fair to assume that the compiler would refuse to parallelize the code?

Kind regards,

Fernando

souza_diniz_mendonca · ‎01-17-2020

Dear Jim,

thank you for looking into this. I had not used -O3 to compile that program.

Kind regards,

Gleison

Vladimir_P_1234567890 · ‎01-17-2020

There is a data race detected on the loop index i (line 9), it looks to be a false-positive.

if you look at assembly compiler check x is within loop range for a[10] = i; and if x is not within the range (len<10) it assigns a[10] = 0;. And compiler sees a correlation between x and i. The same for vector version. So there is nothing wrong with parallelization or vectorization of this loop besides low amount of compute.

Vladimir

jimdempseyatthecove · ‎01-18-2020

By setting x=10 outside the loop, .AND. should the loop get parallelized, then the initial i for each thread will differ, and thus location a = i on first time, for each thread (provided that x is registerized as it can be with -O3) have a race situation as to what value gets inserted into a[10].

Note, the loop control variable i, is implicitly attributed as private. However your x is not, it is implicitly shared. In Debug mode, it is (likely) memory located, with optimization (of this simple loop) it most likely will be registerized, however, by not being explicitly marked as firstprivate either all or all but the master thread will have junk value. So either arbitrary (random) locations get written to with different i together with the race condition .OR. with firstprivate the desired location (a[10]) gets written to with different i together with the race condition.

IOW your non-compliant code will not produce the intended result except by fortuitous accident.

>>So there is nothing wrong with parallelization or vectorization of this loop besides low amount of compute.

The loop (and code) as written has issues:

Is x to be shared or private or firstprivate?
Is x permitted to be registerized?
If x neither private nor registerized (iow shared), which thread's x value is to be used in a=i?

Jim Dempsey

jimdempseyatthecove · ‎01-18-2020

The parallel version of your program could:

Attempt to write to invalid address and SegFault
Successfully write to unintended address (outside of bounds of the array) and corrupt the program in a fatal way
Successfully write to unintended address (outside of bounds of the array) and corrupt the program in a benign/unobservable way
Result in a[10] being written with the value of i of the first iteration of the last thread to perform the statement (all other threads I have advanced passed a[10])
Or, at end of loop you might receive the expected value in a[10].

Jim Dempsey

jimdempseyatthecove · ‎01-18-2020

The problems with this loop, once you fully understand the problems and potential consequences, should instill in you the ability of insight to mentally detect these issues and .NOT. solely rely on the compiler detecting such issues .AND. .NOT. rely on the observation that your simple verification test program never observed an error. Too often, errors are never observed under test, and potentially observed when someone else is using the code. Ascertaining the problem from a problem report is difficult, especially when the problem resides in code that you are "assured" is bug free.

Jim Dempsey

Vladimir_P_1234567890 · ‎01-18-2020

jimdempseyatthecove (Blackbelt) wrote:
>>So there is nothing wrong with parallelization or vectorization of this loop besides low amount of compute.
The loop (and code) as written has issues:
Is x to be shared or private or firstprivate?
Is x permitted to be registerized?
If x neither private nor registerized (iow shared), which thread's x value is to be used in a=i?
Jim Dempsey

do you think the compiler can't recognize such simple patterns and precompute x? all your questions become valid once you add complexity to the loop, for example change x=i to x=i*len. auto-parallelization fails in this case with diagnostics:

LOOP BEGIN at test.cpp(9,9)
   remark #17104: loop was not parallelized: existence of parallel dependence
   remark #17106: parallel dependence: assumed OUTPUT dependence between a (11:17) and a (11:17)
   remark #17106: parallel dependence: assumed OUTPUT dependence between a (11:17) and a (11:17)
   remark #17106: parallel dependence: assumed ANTI dependence between x (11:17) and x (12:22)
   remark #17106: parallel dependence: assumed FLOW dependence between x (12:22) and x (11:17)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed ANTI dependence between x (11:17) and x (12:22)
   remark #15346: vector dependence: assumed FLOW dependence between x (12:22) and x (11:17)
LOOP END

jimdempseyatthecove · ‎01-19-2020

You ask what the compiler should do with an improper (ambiguous) piece of code. The answer to this is: ambiguous results. Also note that your serial code (as written) leaves a[len-1] undefined.

There can be something to be said about the ambiguous results:

With serial code that is optimized, I would expect both i and x to be registerized, giving the same behavior as memory based copies of those variables.

With parallel code, and no disambiguation of what to do with x, for both unoptimized and optimized code, i will be registerized and x will be implicitly shared and thus located in memory. Shared use of **with multiple threads using x execution of the loop overlapping, will produce a race condition on x but not i. It is therefore, due to race condition x for thread N may have the value of x (re)set thread M between the time thread N sets x and on one iteration and then uses x on the next iteration. Should this occur, then the value of i for thread N gets written into a with x having been replaced with the value of i from thread M. IOW it could be possible that some cells of a[...] not get written to while other cells of a[...] receive the incorrect values. *** also it is possible that each thread start time is skewed enough such that all threads initialization of x and running of the loop occurs independent of all other threads and thus produce the expected result.

*** Observing the correct results during testing is no confirmation that your code is correct.

Additionally, the ambiguity of your code is such that x was not declared with the attribute volatile or as an atomic<int>. Therefore, the compiler has some liberties as to how it uses x, in particular, considering modern CPU have small vector instructions, the compiler may be free to generate a small vector of a subsection of a[...] and then store it at either the intended location of a[...] or at some substituted location within a[...] and potentially exceeding a[...] within a small vector width-1.

If you want (require) this loop to be parallelized, then correct you code to produce the results you desire, and do so in a manner such that it can be parallelized. Otherwise be thankful that the compiler did not parallelize this loop such that it does no worse than your code as written (keep in mind of a[len-1] being undefined). The compiler is somewhat like Alice and your code is somewhat like Humpty Dumpty where you want the compiler to generate code as you intend it to be, and not as you actually say it to be.

Jim Dempsey

souza_diniz_mendonca · ‎01-20-2020

Dear Jim and Vladimir,

Thank you for the clarification. I have modified the program to ensure the absence of out-of-memory accesses, regardless of its inputs:

#include <stdio.h>
#include <stdlib.h>
int *a;
int main(int argc, char *argv[]) {
  int len = argc;
  int i;
  int x = argc > 2 ? len - 2 : 0;
  a = (int*)malloc(len * sizeof(int));
  for (i=0;i<len;i++) {
    a = i;
    x=i;
  }
  for (i = 0; i < len - 1; i++)
    printf("%d ", a);
  printf("x=%d",x);    
  return 0;
}

This program is still parallelized by ICC with the following command:

icc -par-threshold0 -no-vec -fno-inline -parallel -qopt-report-phase=all -qopt-report=5

LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(13,3)
   remark #17109: LOOP WAS AUTO-PARALLELIZED
   remark #17101: parallel loop shared={ .2 } private={ } firstprivate={ argc a } lastprivate={ } firstlastprivate={ i } reduction={ }
   remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag
   remark #25439: unrolled with remainder by 2  
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
LOOP END

The resulting binary leads to a race condition, which can be detected by Intel Inspector. Below I am showing the output of Inspector with the sequential program, where argc == 16:

Collection started. To stop the collection, either press CTRL-C or enter from another console window: inspxe-cl -r /home/gleison/Desktop/Fernando_modifed_example/r002ti3 -command stop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x=15Warning: Failed to find more than one thread. Ensure that your program is multithreaded.
 
0 new problem(s) found

And now the output of the Inspector with the binary produced by ICC after parallelization:

icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
Collection started. To stop the collection, either press CTRL-C or enter from another console window: inspxe-cl -r /home/gleison/Desktop/Fernando_modifed_example/r003ti3 -command stop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x=15Warning: One or more threads in the application accessed the stack of another thread. This may indicate one or more bugs in your application. Setting the Inspector to detect data races on stack accesses and running another analysis may help you locate these and other bugs.
 
1 new problem(s) found
    1 Data race problem(s) detected

Would it be possible to have ICC running some analysis to prevent the parallelization of this loop? Or perhaps issuing a warning about the potential data race that the parallelization introduces?

Kind regards,

Gleison

jimdempseyatthecove · ‎01-21-2020

The second report states:

>>Warning: One or more threads in the application accessed the stack of another thread. This may indicate one or more bugs in your application.

This is not an error statement, rather it is a warning of potential error. You must look closer. The warning said so:

>>Setting the Inspector to detect data races on stack accesses and running another analysis may help you locate these and other bugs.

Your code did not specifically state the disposition of variable x. As such it is implicitly shared, however nothing is stated as to how it is to be shared. IOW can it be registerized, is it to be programmed as if it were volatile, is it to be programmed as if it were atomic<int>. The inspector is simply stating:

There is a potential for a programming error. It could have been a little more descriptive, but none the less it is sufficient enough for you to take a look.

BTW with such a short loop run, it is likely that the Inspector would not be able to detect the data race.

Jim Dempsey

jimdempseyatthecove · ‎01-21-2020

Also note, should the thread dispatched to run the last segment of a[...] complete the loop prior to any of the other threads starting the loop .AND. if the last thread updated the shared x, then the first-next thread will bugger a[len]. One past the end of the array.

You have a fundamental programming error with respect to parallelizing this loop. You, not the compiler, must fix this.

Jim Dempsey

Thompson__Max · ‎01-28-2020

Hi everyone. I am new to this forum. Interesting thread, thanks for the information.

ICC 19.0.4.243 parallelized loop with confirmed Race Condition on Lenovo Legion Y7000 16 Gb ram i7 8th gen, ubuntu 18.04.