Segfault for array with both firstprivate and lastprivate clauses

Van_Veen__Lennaert · ‎02-26-2016

In a computational code that manipulates arrays, I want to have OMP sections. One section will write an array to disk, while another section sorts it. To avoid a race condition, I give the array the firstprivate and lastprivate clause, expecting that the threads that execute the sections will both copy the array into a private variable and at the end the thread that sorts the array copies the result into the global, shared variable. This results in a segfault when I use ifort, but not when I use gfortran.

As far as I could find, there is no restriction in the OMP specs on the use of both clauses on one array. Please clarify what the issue is and how to prevent it. I have attached a very short program that segfaults when compiling with "ifort -openmp ..." Valgrid and gdb flag "invalid read" and
"Invalid write" at the lines !$omp sections and !$omp end sections.
This happens on Ubuntu 14.04 with ifort 12.1.0.

Thanks, Lennaert.

TimP · ‎02-26-2016

This would be more topical on the forum section for Intel Fortran on linux.

There is also a post on software,intel.com about diagnosing segfaults with Intel compilers. My first guess would be you have exhausted stack space and will need to try OMP_STACKSIZE and shell stack limit. In my experience, it's worth some effort to avoid private arrays.

Van_Veen__Lennaert · ‎02-26-2016

If my issue is more relevant to the ifort/Linux thread, perhaps a moderator can move it?

Thanks for your comments. I tried reducing the size of the array, and when it is very small (less than 500) the error disappears.
For larger arrays I can get a hangup (the program must be killed from a command line), a seg fault or this:

test.x: pthread_create.c:409: start_thread: Assertion `freesize < pd->stackblock_size' failed.

so that does sound like there is insufficient memory for the private copies of the arrays. What puzzles me though, is that
the extra space I request is small: two threads will create an array of, say, 800 doubles, that is about 12Kb in total.
I ran "ulimit -s unlimited" and set OMP_STACKSIZE to 1M, but that does not change the outcome at all. What hidden limit
do I exceed here? It seems very low.

In the actual code, there is a trade-off between memory (extra private variables) and speed-up, so there is a good reason to try getting this to work.

Andrey_C_Intel1 · ‎02-29-2016

This is a compiler bug. As a workaround you can replace allocatable array with explicit shape array (if possible). I don't know of any other workaround.

I have submitted bug report for our compiler developers, hope they will fix it soon. Also I am moving the post to the Fortran compiler forum just in case.

Regards,
Andrey

Steven_L_Intel1 · ‎02-29-2016

This sounds a lot like the issue discussed in https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/611023

Andrey_C_Intel1 · ‎03-01-2016

This sounds a lot like the issue discussed in https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux...

No actually. The link you mentioned looks like a stack size problem (GNU OpenMP runtime allocates more stack for their threads by default than Intel runtime currently does).

This problem is specific to firstprivate+lastprivate of allocatable array that causes memory corruption regardless of the stack size.

Regards,
Andrey

TimP · ‎03-01-2016

Intel libiomp5 default for OMP_STACKSIZE on x86_64 remains at 4M, as far as I know, Setting it down to 1M might be useful for a large number of threads each using little stack, to avoid consuming much shell stack.

I haven't been able to find easily a reference on what libgomp is setting nowadays in the absence of OMP_STACKSIZE.

Van_Veen__Lennaert · ‎03-01-2016

As I mentioned, setting OMP_STACKSIZE does not seem to change anything, I tried setting it to 1MB just to make sure it was much larger than the amount I requested trough the firstprivate/lastprivate statements. Maybe the compiler developers can confirm this is a bug.

In the mean time, I have worked around the issue by allocating an extra array on the shared stack, before entering the omp parallel region. I then make a copy of the array that had the firstprivate/lastprivate clauses before entering the omp sections. The extra memory I use now should be regulated by 'ulimit' instead of per-thread memory. Not elegant, but it works.

As before, thank you for the quick and accurate assessment.

jimdempseyatthecove · ‎03-02-2016

>>In a computational code that manipulates arrays, I want to have OMP sections. One section will write an array to disk, while another section sorts it. To avoid a race condition, I give the array the firstprivate and lastprivate clause, expecting that the threads that execute the sections will both copy the array into a private variable and at the end the thread that sorts the array copies the result into the global, shared variable.

From the above description and the sketch code provided above it appears that:

a) outside the parallel region you have an unsorted array (in your global array)
b) You enter a parallel region where (in sections) you make 2 copies of the unsorted array
c) One thread writes the (copy of the) unsorted array to the disk
d) Another thread sorts the (different copy of the) unsorted array
e) Assuming the sort is performed in the last section, the sorted array is written back to the global array (else the last section's, and unsorted, array is written back to the global array).

Unstated is if you intentions were as above, or were they intended to write the sorted array to the file. And do so as the sorted results are produced.

If your intention was to write the unsorted array, then you have 3 copies of the array. In this case it would be more beneficial (memory conservative) to have the writer section write the global array in place, and have the sort section allocate a temp array, make a copy, sort the copy, then copy it back to the global after the file write completes (or after sections are written). Or better yet, use the global array as input (first pass) then switch to temp array. IOW sorted output is different array than input.

If your intention is to write the sorted array as it sorts, then the selection of sort method would be critical. An in-place Quicksort would be a poor choice as you couldn't begin writing until after the sort completes. A different sort technique will provide better concurrency of writes while sorting.

Jim Dempsey

jimdempseyatthecove · ‎03-02-2016

Also note, that the sorted output can be "returned" to the global array faster using MOVE_ALLOC.

Jim Dempsey

Van_Veen__Lennaert · ‎03-03-2016

Thanks, Jim. My intention was to have the code execute a-e as you outlined. The order or the writing is not important. Indeed, it would be more memory efficient to have only one extra copy of the array, owned by the sorting thread.
I was programming this code in a graduate course to demonstrate the options OMP offers to make variables private in sections. Therefore, I was not necessarily trying to minimize memory usage or optimize efficiency. We were surprised to see the seg fault come up in a code that follows the OMP specifications. I suppose this taught the me and students a different but equally important lesson..

jimdempseyatthecove · ‎03-04-2016

Since you are a teacher, I suggest that you explicitly include seg fault examples into your curricula. It is an excellent learning experience, well actually the part of recognizing what it is and how to work around it. If you peruse this forum, you'd be surprised at the number of people that do not know what a seg fault is, nor how to fathom the cause, and then generate a coding work around.

While you are at it, ask your students to explain what "unlimited stack" means in a multi-threaded application.

Jim Dempsey

Van_Veen__Lennaert · ‎03-11-2016

Thanks for your suggestion, and do not worry. My students see plenty seg faults in this course. In fact, I got quite a few wry smiles when I showed your comment in class. Thanks also for the ongoing support.

jimdempseyatthecove · ‎03-11-2016

What was there answer to the "unlimited stack" question?

Jim Dempsey