Software Archive
Read-only legacy content

Faulty code with icc and OpenMP (-O2)?

Simon_H_2
Beginner
557 Views

Hi,
please consider the example below, which is the minimal crashing example I could create. This is seems issue which we observe in at least 2 of our applications on the MIC. I did not manage to create a single-file reproducer, but nevertheless it is quite short and simple.

  • shows up only with -O2 (dedault), not with -O0, -O1, and -O3
  • only on the MIC (-mmic)
  • a "#pragma novector" seems to fix (or hide?) the issue
  • I can reproduce this with icc13, icc14, and icc15
  • I have another (very complex) code, where I don't get a crash, but wrong results with a "#pragma omp parallel for". Due to the complexity I could neither show that this is the same issue, nor produce a small example, but it looks related.

Best,

Simon

 

Compile with:

icc -mmic -c helpers.cc -o helpers.o
icc -mmic -openmp -c crash.cc -o crash.o
icc -mmic -openmp test.cc crash.o helpers.o -o test

test.cc:

#include "crash.h"

int main() {
    double *data[1];
    data[0] = new double[16];

    IndexTable table;
    Crash crash(table);

    crash.working_wrapper_for_crash(data); // prints "success"
    crash.crash(data); // gives "Bus error" (or Segmentation fault, with some small changes)
}

crash.h:

#ifndef CRASH_H
#define CRASH_H

#include "helpers.h"


class Crash {
    public:
        Crash(const IndexTable &table);

        void working_wrapper_for_crash(double **data);
        void crash(double **data);

    private:
        double *buffer[1];
        const IndexTable &table;
};


#endif // CRASH_H

crash.cc:

#include "crash.h"
#include <stdio.h>


Crash::Crash(const IndexTable &table) : table(table) {
    buffer[0] = new double[size()];
}


void Crash::working_wrapper_for_crash(double **data) {
    double *tmp[1];
    tmp[0] = data[0];

    crash(tmp);
}


void Crash::crash(double **data) {
    int mu=0;

#pragma omp parallel for num_threads(1)
    for(int i=0; i<size(); i++) {
        int site_index = table.index(2*mu+1, i)*components();

        for(int c=0; c<components(); c++)
            buffer[mu] = data[mu][site_index+c];
    }
    printf("success\n");
}

helpers.h:

#ifndef HELPERS_H
#define HELPERS_H

int components();
int size();

struct IndexTable {
    int index(int i, int j) const;
};

#endif // HELPERS_H


helpers.cc:

#include "helpers.h"

int components() {
    return 1;
}

int size() {
    return 4;
}

int IndexTable::index(int i, int j) const {
    return 0;
}

 

0 Kudos
6 Replies
Kevin_D_Intel
Employee
557 Views

Hi Simon - Thank you for the efforts to isolate/reduce this as far as you have. I will try reproducing what is described and direct this to the appropriate Development team.

0 Kudos
Kevin_D_Intel
Employee
557 Views

I was successful reproducing the failure and reported it to Development (see internal tracking id below) and will keep this post updated as I learn more.

I could not determine where to place the novector you mentioned to avoid the error. For me, placing a pragma noprefetch before the nested for inside the parallel for (in crash.cc) avoided the error, or compiling that source file at -O1.

(Internal tracking id: DPD200365736)

0 Kudos
Simon_H_2
Beginner
557 Views
Hi Kevin, are there any updates regarding this issue? Thanks, Simon
0 Kudos
Kevin_D_Intel
Employee
557 Views

At present, the last update indicates having identified the root cause related to inserting prefetch instructions along with a suggestion for a potential fix. I requested a current status update and will pass along what I hear back.

0 Kudos
Kevin_D_Intel
Employee
557 Views

Simon, I summarized below the update that I received from Development regarding this issue.
 
They indicate there is a combination of two issues the test case exposed and that the fix will be to turnoff prefetching for the concerned loop. However, they offered some source code modifications that can help avoid losing the prefetching optimization for the concerned loop. It is unclear if/how these changes might be usable in your original source code, hopefully they will be.
 
The first change is to direct use inside an OpenMP code of a constant variable defined outside it. In the case of the crash function, the variable is “mu”. With OpenMP, the variable mu becomes a memory dereference that inhibits optimizations. The introduction of a local copy to the OpenMP code ensures the compiler knows this variable is indeed constant. Adding the line “int mu_local = mu;” at the beginning of the outer for loop on "i" will solve this issue. This is a best practice for OpenMP.
 
The second issue relates to using functions as the upper bound of loops. As per the language standard, functions have to be evaluated at each loop iteration. This also inhibits some optimizations. If those functions do not have side effects in the original code, they should be removed from the ‘for’ declaration.  In this case, assuming the function “components” returns the same result throughout the execution of the loop, it can be rewritten as follows:
 
int bound = components();
for(in c = 0 ; c < bound ; c++) { …
 
I did not try either suggestion myself; however, they indicated both of the mentioned changes to the source code should work with the current 15.0 compiler.
 
I will keep you updated on the final fix as learn more about that.

0 Kudos
Simon_H_2
Beginner
557 Views

Hi Kevin,

You suggestions are a bit unclear. In the beginning you are talking about issues, but later only about best practices and things preventing optimization. Is there something wrong with my code, or is it a bug in icc?

Regarding your first suggestion of copying a variable into a local one. Are you suggesting to do this for every single variable that is used inside any loop? What if it is a larger type, such as a struct or a class?

Regarding the second suggestion, I don't see why this is related to the OpenMP issue. Of course you are right that there are cases where the loop bound can be made fixed, but in general it is not. If you would be talking about the loop over "i" I might understand that there can be issues. But this is a nested loop, so why should it be able to influence the OpenMP behavior?

As it stands, this seems to leave thousands of potential issues in many code-bases. If this is really an issue with my code, can you point me to some reference regarding where I violate the standard?

Thanks, Simon

0 Kudos
Reply