- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I compile the following small program with -O2 or -O3 and -parallel and run it with OMP_NUM_THREADS>1 it crashes with signal 11 either immedeately (-O2) or when trying to print out res (-O3). Running the parallel binary with OMP_NUM_THREADS=1 does work. When the last printf statement is commented out, the program will run with more than one thread if compiled with -O3 -parallel.
I first observed this with Compiler XE for applications running on Intel(R) 64, Version 13.1.2.183 Build 20130514, older versions starting from 10.1 show the same or similar problems, C++ Compiler for Intel(R) EM64T-based applications, Version 9.1 Build 20060925 is the last version which does produce a working parallel binary. All Compilers are run under Linux SLES11SP3.
#include <stdio.h>
#define ARDIM 2000
int main (int argc, char **argv) {
double a[ARDIM][ARDIM], b[ARDIM][ARDIM], c[ARDIM][ARDIM];
double di=0.0,dj=0.0,res=0.0 ;
int i,j,k;
for (i=0;i<ARDIM;i++) {
di+=1.0e0;
for (j=0;j<ARDIM;j++) {
dj+=1.0e0;
a
b
c
}
dj=0.0;
}
for (i=0;i<ARDIM;i++) {
for (k=0;k<ARDIM;k++) {
for (j=0;j<ARDIM;j++) {
c
}
}
}
for (i=0;i<ARDIM;i++) {
for (j=0;j<ARDIM;j++) {
res+=c
}
}
printf("\n c[1][2] = %f\n",c[1][2]);
/* printf("\n res = %f\n",res); */
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know why Build 20060925 worked. But your program runs fine with current compilers, as long as you bump up the main stack size and thread stack sizes.
Your perfectly nested loop does get auto-parallelized, and it has large stack requirements, since each double array requires 8*2000*2000 = 32M bytes. So with 3 32M arrays, each thread needs at least 96M for its private stack. 100M works fine for both main stack and threads:
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.144 Build 20140120
Copyright (C) 1985-2014 Intel Corporation. All rights reserved.
$ icc -parallel U509827.cpp -par-report -O3
U509827.cpp(51): (col. 3) remark: LOOP WAS AUTO-PARALLELIZED
$ ulimit -s 100000000
$ export OMP_STACKSIZE=100M
$ ./a.out
c[1][2] = 9.866605
res = 6584316148511.074219
$
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, defining OMP_STACKSIZE does help, we use ulimit -s unlimited anyway. Stacksize must be defined >= OMP_STACKSIZE*OMP_NUM_THREADS,
However I do not understand, why each thread requires its own copy of the three arrays, and I see the possibility to run into the vmem limit on manycore systems.
Axel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The excessive stacksize requirement can be avoided by simply moving the array declarations outside of main().
The memory used (resident set size) is in both cases slightly more than the size of the 3 arrays, only the virtual address space required is increased by a factor of OMP_NUM_THREADS. It looks as if the threadprivate stack is allocated but (essentially) not used.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page