Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Code with openmp works in windows but not linux

yang_l_3
Beginner
789 Views

Dear all,

I am struggling with my code in linux and need your help. My code works fine in my intel visual studio 15 with openmp, after I have changed the stack size. However, it does not work in the cluster (linux).

The error says: "forrtl: severe (174): SIGSEGV, segmentation fault occurred"

I have tried ulimit, -heap-array, but they do not work. I added the -g-traceback option. However, I didn't get much information except "KMP_STACKSIZE: overrides OMP_STACKSIZE specified before", and yes, I tried to change KMP_Stacksize and OMP_STACKSIZE also. I am pretty sure that my stack size limit in linux (200gb) is much larger than my own computer. And I have set the num of threads to be fixed(=6) for both windows and linux. Where should I debug?

 

Many thanks

 

0 Kudos
8 Replies
TimP
Honored Contributor III
789 Views

Default omp_stacksize is the same on Linux as Windows, unless you drop back to 32 bit mode.  Your ulimit -stack must accommodate omp_stacksize times number of threads, as well as the stack you would use in serial mode.  I've never heard of omp_stacksize needing to increase beyond 40MB nor should more than a slight increase over 4MB be needed if that works on Windows.

0 Kudos
yang_l_3
Beginner
789 Views

In visual studio, I set linker-system-stack reserve size to 200MB (if less than 100mb I still have stack overflow problem)

0 Kudos
Steven_L_Intel1
Employee
789 Views

Also, a segfault is not necessarily a stack overflow.

0 Kudos
yang_l_3
Beginner
789 Views

Thank you Tim and Steve for the hints. I use default(private), shared(...), and I double checked that it works fine under windows intel fortran 15. It is not working in the cluster with ifort 11.0 with error "severe(174)". I tried to find if there are some large array variables defined as private but it didn't work. Also, if I parallelize a smaller loop, or use default(shared) in Linux, the code works (of course although with errors in shared option). Feel confused...

0 Kudos
jimdempseyatthecove
Honored Contributor III
789 Views

Try this:

Preparation, on your Windows system where the program works in debug build, if you don't have one, add an harmless parallel region at the start of the program (e.g. simple call to get_num_threads() that you discard), also in preparation add an !$omp barrier at the top of the (suspected) problem parallel region, build for debugging. Make sure you build with interface checking and all runtime checks enabled. Please bear in mind that the fact that the program runs in Windows does not indicate that the program is without error (index out of bounds and/or argument mistype may not necessarily result in a crash).

First, use the debugger and place a break point before the first parallel region. Start and run to this pre-first-parallel region. Then use the Windows Task Manager to examine the application's memory usage. Write this down (e.g. jot into a Notepad window). This tells us the base footprint size of the application.

Next, place the break point after the first parallel region, and continue to this break point, get the memory footprint. This will establish the OpenPM thread pool and will tell us the total usage requirements after initial thread pool creation (with initial stack size). The change in memory footprint divided by number of threads will tell us the incremental (per thread) _initial_ memory requirements.

Next, place the break point on a statement before the (suspected) problem some parallel region (note, on earlier versions of IVF this could not be successfully placed on the immediate statement before a parallel region, if necessary go one statement earlier). Get the memory footprint here. This number will inform us of the pre-problem parallel region memory requirements. IOW after any potential allocations.

Next, place a break point after the newly inserted !$omp barrier inside the problem region. Get the memory requirements of the application. The change in memory requirements from the prior memory requirement, divided by the thread count, will tell us the per-thread private data requirements.

Next, place a break point after this problem region and see if the program completes the region. Note, if the problem parallel region is a loop, check the memory footprint for a few iterations to see if there is a creep in memory size, note the size change per iteration, you can remove the break point after the barrier, and continue to complete the parallel region. Get the memory size and write it down.

Lastly, continue to verify that the program runs to completion. If it loops and returns to the break point before the problem parallel region, record the before region and after region memory requirements. Do this a few iterations to assure memory requirements keep increasing. If necessary, remove those breakpoints and place one just before the program end and run to completion, get memory requirements.

Save the log of memory requirements (if need be print it out for comparison with your Linux run).

On your Linux system, compile the program (add the harmless parallel region and barrier if necessary). Insert the same break points. Make a run creating a log of memory requirements. Use whatever Linux tool you prefer to get the process size of your application.

With some luck, a discrepancy will show up and the program will crash at some point. The memory footprint discrepancies and crash point will give us some insight as to what is going on.

As a forethought, it is usually not a good practice to use default(private) as this often results in unnecessary and/or errant memory allocations. The best practice is default(none), then explicitly clause the variables as needed.

Jim Dempsey

0 Kudos
yang_l_3
Beginner
789 Views

Thank you Jim for the detailed instruction. I followed your instruction and now I am suspecting that it is indeed a stack overflow problem.

Here is what I did:

First, I use default(none) with proper specification of shared(..)and private(..), which runs properly in Windows system. The parallel loop is something like this

Do i

              print *, "flag"       

               a = 1.d0

              print *, a(2)

              pause

    Do j

               [.....]

    end do

end do

Let's assume variable a has the length 100. It works fine in window, but gives the severe(174) error in linux where "flag" is shown on the screen but a(2) is not printed out. When I changed the variable a from private to shared, the code works until pause. And then I press the enter button the severe(174) error comes up. Beacuase I suspect the allocation property, I also tried another thing: define all the private variables as firstprivate except the counters, then there is no severe(174) error, although the result is very different compared to the results in Windows running the same code. Further, every time Linux gives me different results while Windows give me the same one.

It may sounds stupid that the code package with openmp was running in Linux last night. Then I wanted to put the Do PARALLEL in a outer loop. After changing some variables in the shared/private property, it still has the severe(174) problem. Then I somehow made some variables allocatable and maybe some other tiny stuff. And now I switched back to the initial parallalization and  it is no longer working. :(

 

0 Kudos
yang_l_3
Beginner
789 Views

Thanks for all your help! I found one main difference between my Windows ifort and Linux ifort:

In my parallel loops, I have an allocatable variable that may suffer from a subtle race problem. However, this problem is prevented automatically by my windows ifortran 15, but not by linux ifort 11 (I guess there is an update from openmp v3.0 to v4.0). After I change from defining that variable from allocatable to directly defining its dimension, the problem is solved. I have very little knowledge about virtual memory but I guess there is a difference.

0 Kudos
jimdempseyatthecove
Honored Contributor III
789 Views

If the array a(:) "lives" entirely within the parallel region, then when using the older Intel compiler use something like

real, allocatable :: a(:)
...
! a is not allocated at this point
!$omp parallel firstprivate(a) ... ! other clauses here
! firstprivate copies the unallocated array descriptor to the local thread context (on stack)
allocate(a(yourSizeHere))
!$omp do
do ...
...
end do
!$omp end do
...
deallocate(a)
!$omp end parallel

The firstprivate(a) was required on earlier Intel compilers (private(a) would have instantiated the array descriptor but would not have initialized it to be empty, this is fixed in the newer compilers).

Jim Dempsey

0 Kudos
Reply