Stack Size and openMP

Martin_D_1 · ‎08-25-2014

Dear All,

there a various threads here as well as in other forums dealing with segmentation fault problems related to the stack size and openMP. Unfortunately, the information is (at least for my understanding) not consistent and--because some threads are rather old--it might also be outdated. My primary information is https://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/, where the recommendation for openMP applications is to remove the stack size limit of the operation system via

ulimit -s unlimited

or set it to a high value, rather than giving the compiler a limit (via -heap-arrays) for the size of arrays that it should put on the stack (i.e. all arrays known to compile time to be larger are put on the heap). Unfortunately, the other thread linked in the aforementioned (https://software.intel.com/en-us/articles/intel-fortran-compiler-increased-stack-usage-of-80-or-higher-compilers-causes-segmentation-fault) recommends using "-heap-arrays" independent if the application uses openMP or not. Also, https://software.intel.com/en-us/forums/topic/501500#comment-1779157 recommends not to set the stack size to unlimited.

I tried to set the stack size to a bigger size (default on Ubuntu is 8 kiB), but this sometimes causes problems with other software that is using threads. Also, it is hard for me to estimate the required stack size.

A critical part of the my code looks like

function mesh_build_cellnodes(nodes,Ncellnodes)
	  
	 implicit none
	 integer(pInt),                         intent(in) :: Ncellnodes
	 real(pReal), dimension(3,mesh_Nnodes), intent(in) :: nodes
	 real(pReal), dimension(3,Ncellnodes) :: mesh_build_cellnodes
	 
	 integer(pInt) :: &
	   e,t,n,m, &
	   localCellnodeID
	 real(pReal), dimension(3) :: &
	   myCoords
	  
	 mesh_build_cellnodes = 0.0_pReal
	!$OMP PARALLEL DO PRIVATE(e,localCellnodeID,t,myCoords)
	 do n = 1_pInt,Ncellnodes
	   e = mesh_cellnodeParent(1,n)
	   localCellnodeID = mesh_cellnodeParent(2,n)
	   t = mesh_element(2,e)
	   myCoords = 0.0_pReal
	   do m = 1_pInt,FE_Nnodes(t)
	     myCoords = myCoords + nodes(1:3,mesh_element(4_pInt+m,e)) &
	                         * FE_cellnodeParentnodeWeights(m,localCellnodeID,t)
	   enddo
	   mesh_build_cellnodes(1:3,n) = myCoords / sum(FE_cellnodeParentnodeWeights(:,localCellnodeID,t))
	 enddo
	!$OMP END PARALLEL DO
	 
end function mesh_build_cellnodes

where "mesh_Nnodes" can be in the range of 1000 to some Mio. According to https://software.intel.com/en-us/forums/topic/301590#comment-1524955, I understand that the "-heap-arrays" option will force the compiler to allocate "nodes" to the heap and not to the stack independently of the value given because its size is not known at compile time. A possible solution is, to give

-heap-arrays 8

as an compiler option since this seems to be a reasonable value for the stack size. In fact, on Ubuntu it is

ulimit -s 8192

My application runs fine with that (but would crash when I do not use the compiler option). The other possibility is still to unlimit the stack size (and omit the compiler option).

Since removing the stack size limit (and omitting the heap-arrays option) will significantly improve the performance at a slightly higher memory consumption (according to the time command, see attached file), I would rather use this option.

Therefore, my question is, if this method has any disadvantages for an application running with 1 to 32 threads.

Thanks in advance for your contributions and apologies for raising this question again.

jimdempseyatthecove · ‎08-25-2014

Ask yourself: What does unlimited stack mean for a single threaded application with a fixed, and finite, virtual memory address space?

How might you implement this without knowing how much virtual address space to reserve for the heap?

One possibility is:

Half (or large portion) of address space is reserved for system (kernel) code.

From the remainder subtract the code, static data, and initialize data and take (reserve) half for heap and the other half for unlimited stack.

That takes care of the main thread.

Now your program initializes the OpenMP thread pool. While these threads can share the reserved heap, where does their "unlimited" stack come from?

One option is for the runtime system to look at the lowest address ever touched by the main thread and take the size between that and the lowest possible address that the prior (master) thread could have used. IOW steal half the unused "unlimited" stack of the first thread. Half of the whole is better than none.

Now then, ask yourself, what happens for the next thread? Does it get half of the 2nd thread's unused address... with 32 threads, you might end up with the last thread obtaining 1/4,000,000,000 the stack of the fist thread. IOW not workable.

An alternative is at start of OpenMP thread pool for the OpenMP runtime system to look at the number of threads it will create, and then evenly divvy up the unused portion of the stack of the main thread to all initial threads. For many situations this may be suitable. But now how do you expect to handle nested OpenMP? And what to do for non-OpenMP threads.

You should be able to see now that "unlimited" is not a good choice. You, as a developer, should be able to know how much stack space to provide. Using the heap for the very large arrays is still recommended.

Jim Dempsey

Steven_L_Intel1 · ‎08-26-2014

"unlimited" really means "set it to the limit defined in the kernel".

jimdempseyatthecove · ‎08-27-2014

Bad choice for words. "maximum" would have been a better choice.

Jim

Steven_L_Intel1 · ‎08-27-2014

Well, it's Linux. You're lucky it's a word at all.