Ifort 11.1 bug (?) segfault with nested OpenMP and large private arrays

jonathanvincent · ‎09-06-2010

Seen this on two different computer systems

Seems to be a compiler issue.

Program that segfaults
*****************************************************
program tomp
implicit none

integer,external:: omp_get_num_threads
integer,external:: omp_get_thread_num
logical,external :: omp_get_nested

integer nthread1,m1

call omp_set_nested(.true.)

!$omp parallel private (m1) num_threads(4)

call omp_set_nested(.true.)

nthread1 = omp_get_num_threads()
m1 = omp_get_thread_num()

write(*,*) 'outer: Running on nthread1=',nthread1,m1
write(*,*) 'outer: ',omp_get_nested()

call inner(m1)

write(*,*) 'outer: done ',m1

!$omp end parallel
end program tomp

subroutine inner(m1)

integer,parameter :: n=1000
c increasing n to 1000 will give segmentation fault with ifort

real a(n,n)
integer,external:: omp_get_num_threads
integer,external:: omp_get_thread_num
logical,external:: omp_get_nested

integer m1,i,j,m2,nthread2,k

!$omp parallel private(m2,i,j,k,a) num_threads(3)

call omp_set_nested(.true.)

nthread2 = omp_get_num_threads()
m2 = omp_get_thread_num()

write(*,*) 'inner: Running on nthread2=',nthread2,m2,m1
write(*,*) 'inner: ',omp_get_nested()

a=0.
do k=1,1000
do i=1,n
do j=1,n
a(i,j) = sin(a(i,j))**2.
end do
end do
end do

write(*,*) 'inner: done ',m2
!$omp end parallel

end subroutine inner
*********************************************************

Program that seems to work.

**********************************************************
program tomp
implicit none

integer, parameter :: othreads=4
integer, parameter :: n=1000

integer,external:: omp_get_num_threads
integer,external:: omp_get_thread_num
logical,external :: omp_get_nested

integer nthread1,m1

real :: a(n,n,othreads)

call omp_set_nested(.true.)

!$omp parallel private (m1) num_threads(othreads)

nthread1 = omp_get_num_threads()
m1 = omp_get_thread_num()

write(*,*) 'outer: Running on nthread1=',nthread1,m1
write(*,*) 'outer: ',omp_get_nested()

call inner(m1,n,a(:,:,m1))

write(*,*) 'outer: done ',m1

!$omp end parallel
end program tomp

subroutine inner(m1,n,a)

real a(n,n)
integer,external:: omp_get_num_threads
integer,external:: omp_get_thread_num
logical,external:: omp_get_nested

integer m1,i,j,m2,nthread2,k

!$omp parallel private(m2,i,j,k,a) num_threads(3)

nthread2 = omp_get_num_threads()
m2 = omp_get_thread_num()
nested = omp_get_nested()

write(*,*) 'inner: Running on nthread2=',nthread2,m2,m1
write(*,*) 'inner: ',omp_get_nested()

a=0.
do k=1,1000
do i=1,n
do j=1,n
a(i,j) = sin(a(i,j))**2.
end do
end do
end do

write(*,*) 'inner: done ',m2
!$omp end parallel

end subroutine inner

jimdempseyatthecove · ‎09-06-2010

First, I suggest you use

USE OMP_LIB

To declare the OpenMP library interfaces.

Second, declaring in inner: real a(n,n)
with n as a parameter (=1000) may (one of)

allocate a(n,n) as SAVE
allocate a(n,n) on stack
allocate a(n,n) off heap

As written it is ambiguous as to what will happen

For OpenMP you would want inner's a to be local to thread calling inner. IOW .not. SAVE.
To assure .not. SAVE

recursive subroutine inner(m1)
...
real a(n,n)
.or.
real, allocatable :: a(:,:)
...
allocate a(n,n)
...
deallocate(a)
end subroutine inner

However, creating a on stack will consume 4MB or 8MB of stack space
to avoid this, consider enabling heap arrays .or. using the real, allocatable :: a(:,:) technique.
Use of option heap arrays is unclear in the code. The next person supporting your code might not be aware of this and neglect to include the compiler option and therefore inadvertantly introduce a bug into their code design (code is correct but not performing as expected/required).

keep aware that inside the parallel region within inner, that m2 is the 0-based team member number of the team established by team member m1 of the thread team calling inner (and m2==0 for each calling team member from the caller thread team). IOW assuming all threads are granted, you will have 12 threads

and write(*,*) 'inner: done ',m2

will write 4 sets of m2 = 0,1,2 (interleaved arbitrarily).

You might need to insert !$OMP critical around your writes

Jim Dempsey

jonathanvincent · ‎09-06-2010

Hi,

Thanks for the input, I agree with the use of the module.

I am not quite sure what your point is with the rest of it though. I guess I could have been more clear. This was intended as a small example, the first one generates a segmentation violation, if n is large enough (1000 on our systems), but does not if n is small (around 10). This segmentation fault also happens if for example you allocate the array etc.

There seems to be a problem with nested OpenMP, and large private arrays.

If you take the outer omp parallel region out, there is no seg fault. If you leave it in, but select 1 thread (which should be pretty equivilent) then there is a segfault.

There does seem to be a problem where you have nested parallel regions, where a large array is private in the first parallel region, and then also private in the second. (In our case that would give 12 copies of the array, and take up about 48 MB of memory). As I understand it this is allowed by the standard, but a segmentation fault occurs.

When you have a shared array in the first parallel region, then a private array in the 2nd (which still gives a 1000,1000,12 array in effect then you do not get the segnemtation violation at runtime.

So what I was more interested in, is nesting private variables legal, or is there anything else in there which could potentially cause a segmentation violation, or is there a compiler bug?

Jon

jonathanvincent · ‎09-06-2010

Ok cut the example down even more.

if othreads=1 or n=10 then it works fine. For large n with othreads >= 2 then we get a segmentation violation.

My understanding is that the end result of othreads=1 ithreads=4 and othreads=2 and ithreads=2 should be pretty much the same. Except the second one results in a segfault, and the first one does not. Interstingly othreads=4 ithreads=1 also gives a segfault.

I am happy to be shown to be wrong, but it does look like something is not working as it should with the compiled code.

program tomp
use OMP_LIB
implicit none

integer, parameter :: othreads=2
integer, parameter :: ithreads=2
integer, parameter :: n=1000

integer i,j

real :: a(n,n)

call omp_set_nested(.true.)

!$omp parallel private(i,j,a) num_threads(othreads)
!$omp parallel private(i,j,a) num_threads(ithreads)

a=0.
do i=1,n
do j=1,n
a(i,j) = sin(a(i,j))**2.0d0
end do
end do

!$omp end parallel
!$omp end parallel

end program tomp

jimdempseyatthecove · ‎09-07-2010

program tomp
use OMP_LIB
implicit none

integer, parameter :: othreads=2
integer, parameter :: ithreads=2
integer, parameter :: n=1000

integer i,j
! declare array descriptor only
real, ALLOCATABLE:: a(:,:)

call omp_set_nested(.true.)

!$omp parallel private(i,j,a) num_threads(othreads)
! here, each outer level thread has private (stack located) unallocated array descriptor
! (less than 100 bytes of stack consumed per thread)
!$omp parallel private(i,j,a) num_threads(ithreads)
! here, each inner level thread has private (stack located) unallocated array descriptor
! (less than 100 bytes of stack consumed per thread)
! now allocate seperate/private memory blocks per thread
ALLOCATE(a(n,n))
a=0.
do i=1,n
do j=1,n
a(i,j) = sin(a(i,j))**2.0d0
end do
end do
! each thread deallocates seperate/private memory block
DEALLOCATE(a)
!$omp end parallel
!$omp end parallel

end program tomp

Martyn_C_Intel · ‎09-10-2010

When an application is built with -openmp, local arrays are placed on the stack, so that each thread can have a private copy, should that benecessary. Since the default maximum stack size is quite small on many Linux distributions, the maximum often needs to be increased for OpenMP applications, to avoid a seg fault when the maximum is exceeded. Try ulimit -s unlimited (or limit stacksize unlimited for the C shell).

If the array is actually made private in an OpenMP parallel region, the thread stack size may also need to be increased, either with an environment variable (OMP_STACKSIZE or KMP_STACKSIZE) or with a corresponding RTL call. These result in actual memory allocations and are not upper limits (unlike the shell stack limit), so the values specified should not be arbitrarily large.

See also the following:
http://software.intel.com/en-us/articles/threading-fortran-applications-for-parallel-performance-on-multi-core-systems/
and
http://software.intel.com/en-us/articles/openmp-option-no-pragmas-causes-segmentation-fault /
and
http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/optaps/common/optaps_par_var.htm