Solved: Re: Segmentation fault when using large array with OpenMP

shiming_chen · ‎09-08-2009

We developed a finite difference+pseudospectral CFD model and parallelized with OpenMP. The array size of primitive variables are 64x64x150 in the developing stage. Now we want to use larger array. I tested the simulation results with sequential and parallel code in 64x64x150, 128x128x150, and 256x256x150, and the results were consistent. The sequential code and parallel code are the same code base, but are compiled with/without -openmp flag.

The problem is that when I ran the model with 512x512x150, I had to use -mcmodel=medium -shared-intel flags and got a segmentation fault. According to the this article (http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/), I tried to solve this problem. But it did not help. I have tried run 512x512x150 sequentially and it seemed work (I just run 100 time steps). The parallel 512x512x150 does not run even 1 time step. This model call FFT from MKL. I do not know if this calling produces the problem. Here is my compiler verison and compiling command:

compiler version: l_cprof_p_11.1.046
compiling command:
ifort -r8 -mcmodel=medium -shared-intel $(my source codes) $(MKL_LIB)
$(MKL_LIB) = -L${LIBRARY_PATH} -Wl,--start-group -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -Wl,--end-group -lpthread

The attachments is our model. The array size is defined on nnx, nny, and nnz in const_mod.f90. Type make to build it. Time steps is defined in nt in par.txt.

TimP · ‎09-08-2009

Have you considered thread stack size requirement? The default limit should be 4MB on x86_64, so you may require KMP_STACKSIZE environment variable or function call.

View solution in original post

TimP · ‎09-08-2009

Have you considered thread stack size requirement? The default limit should be 4MB on x86_64, so you may require KMP_STACKSIZE environment variable or function call.

shiming_chen · ‎09-08-2009

After I export KMP_STACKSIZE=1g, it works! The problem is due to insufficient memory per thread whcih is limited by KMP_STACKSIZE. Thank you!

Mike_Rezny · ‎09-08-2009

Quoting - shiming.chen

After I export KMP_STACKSIZE=1g, it works! The problem is due to insufficient memory per thread whcih is limited by KMP_STACKSIZE. Thank you!

Hi,
It is unusual to need such a large KMP_STACKSIZE. Are you sure that the model will not run with this set to something smaller?

Since most OpenMP programs are sharing work, most large variables are usually defined as shared and are not allocated on the individual thread stack.

I would suggest looking carefully at where large variables are being declared as private or firstprivate and look carefully whether they can be declared as shared. Then they will not need to be allocated on the individual thread stack.

regards
Mike

TimP · ‎09-08-2009

As the 126x126... case worked with default thread (4MB) stack limit, 256x256x... should run with 16MB stack limit. I agree with Mike about the desirability of avoiding excessive thread local data, if shared arrays can be defined so that they are partitioned effectively by OpenMP.

shiming_chen · ‎09-08-2009

Is there impact to performance when using too much thread private stack? I read from the book that the OpenMP program performs better with more private variables because of caches. As a result I try to use more private variables than shared.

jimdempseyatthecove · ‎09-09-2009

When each thread allocates large transient temporaries out of (large) stack space then this increases the Virtual Memory foot print. When each thread allocates large transient temporaries out of heap space then the Virtual Memory foot print can be less (when only a few of the threads concurrently require those large transient temporaries). When your Virtual Memory footprint exceeds Physical Memory, then you will enter (require) paging within your application. Paging overhead is several orders of magnitude greater than heap allocation overhead.

Your actual circumstances will determine as to if heap vs stack is more effective.

Note, using a scalable allocator will significantly reduce the allocation overhead.

Jim Dempsey