Segmentation fault on MPI, runs properly on OpenMP

bellman · ‎03-17-2010

Hi,

I am trying to run a program on a computer cluster. The structure of the program is the following:

PROGRAM something
...
CALL subroutine1(...)
...
END PROGRAM

SUBROUTINE subroutine1(...)
...
DO i=1,n
CALL subroutine2(...)
ENDDO
...
END SUBROUTINE

SUBROUTINE subroutine2(...)
...
CALL subroutine3(...)
CALL subroutine4(...)
...
END SUBROUTINE

The idea is to parallelize the loop that calls subroutine2. Main program basically only makes the call to subroutine1 and only its arguments are declared. I use two alternatives. On the one hand, I write OpenMP clauses arround the loop. On the other hand, I add an IF conditional branch arround the call and I use MPI to share the results.

In the OpenMP case, I add CALL KMP_SET_STACKSIZE(402653184) at the beginning of the main program and I can run it with 8 threads on an 8 core machine. When I run it (on the same 8 core machine) with MPI (either using 8 or 1 processors) it crashes just when makes the call to subroutine3 with a segmentation fault (signal 11) error. If I comment subroutine4, then it doesn't crash (notice that it crashed just when calling subroutine3 and it works when commenting subroutine4).

I compile with mpif90 using MPICH2 libraries and the following flags: -O3 -fpscomp logicals -openmp -threads -m64 -xS. The machine has EM64T architecture and I use a Debian Linux distribution. I set ulimit -s hard before running the program.

Any ideas on what is going on? Has it something to do with stack size?

Thanks in advance

bellman · ‎03-18-2010

Let me try to add some information in case it helps. The main purpose of the main program is to call subroutine1. Subroutine1 is the only one that makes calls to MPI/OpenMP functions. This subroutine is a Quasi-Newton minimization that I am probably allowed to share... so here it is its code:

[fortran]module Quasi_Newton

implicit none
private
public :: minim_bfgs

contains

subroutine subroutine1(p,nop,np,move,maxiter,stopcr)

use readdata
use omp_lib
implicit none
include 'mpif.h'

integer, intent(in) :: nop,np,move(nop),maxiter !Number of parameters, number of parameters to be moved, list of the parameters to be moved and max iterations
real(8), intent(inout) :: p(nop) !Parameter vector
real(8), intent(in) ::  stopcr !Stopping criterion
integer :: iter, ipar, jpar, impar, jmpar, iproc !Counters
real(8) :: df(np), prevdf(np), ddf(np,np), tempdir(np,1), dir(nop) !Gradient, Gradient at previous iteration, Hessian, temporary allocation of direction vector and direction vector
real(8) :: y(np,1), identity(np,np), outer(np,np) !Gradient at xn1-gradient at xn, identity and outer matrix used to update the hessian
real(8) :: an, an1, adf, addf !Factor that multiplies the direction of movement, updated factor, first and second differentials with respect to an
real(8) :: func0,func1,func2,func3,func4,func5 !Function evaluations
real(8) :: p0(nop),p1(nop),p2(nop),p3(nop),p4(nop),p5(nop) !Vectors of parameters to be evaluated
real(8) :: xn(nop), xn1(nop) !Current and updated guesses of parameters
real(8), parameter :: eps=0.0002, c1=0.0001, c2=0.9 !Epsilon for the differentials and c1 and c2 parameters of the Wolfe conditions
logical ::  matrix_ill !Logical matrix used in the inverse to detect wrong situations
real :: currtime !Variable to store the time used

!******* variables related mpi *******

integer :: ierror !return error message from mpi subroutines
integer :: id     !identification number of each processor
integer :: nproc  !total number of processors
integer:: tag     !used for tag of send/recv operations
integer,dimension(mpi_status_size):: mpi_status !used to store status of mpi_recv



!******* initialization *******

!initialization of mpi environment

call mpi_init(ierror)

!obtain id for each node

call mpi_comm_rank(mpi_comm_world, id, ierror)

!returns the number of nodes

call mpi_comm_size(mpi_comm_world, nproc, ierror)

if (id==0) write (3,1000)
if (id==0) print 1100

!! FIRST ITERATION: CALCULATE JACOBIAN AT THE INITIAL GUESS OF PARAMETERS AND GUESS THE HESSIAN EQUAL I

xn=p

!! Compute the gradient
if (id==0) print 2100
impar=1
!   !$omp parallel do num_threads(1) default(none) private(p1,p2,func1,func2,impar,currtime,ipar) shared(xn,df,nop,algorithm,globaltime,move)
do ipar=1,nop
if (move(ipar)==0) cycle
if ((sum(move(1:ipar))>=id*floor(1.*np/nproc)+1+id*(np-nproc*floor(1.*np/nproc)>=id)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc)(sum(move(1:ipar))<=(id+1)*floor(1.*np/nproc)+(id+1)*(np-nproc*floor(1.*np/nproc)>=id+1)&
+(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc)p1=xn
p2=xn
p1(ipar)=xn(ipar)+eps
call subroutine2(p1,func1,1)
currtime=secnds(globaltime)
print 2110, ipar, floor(currtime/3600) ,floor((currtime-floor(currtime/3600)*3600)/60),currtime-floor(currtime/60)*60, func1
p2(ipar)=xn(ipar)-eps
call subroutine2(p2,func2,1)
currtime=secnds(globaltime)
print 2111, ipar, floor(currtime/3600) ,floor((currtime-floor(currtime/3600)*3600)/60),currtime-floor(currtime/60)*60, func2
impar=sum(move(1:ipar))
df(impar)=(func1-func2)/(2*eps)
endif
enddo
!    !$omp end parallel do
call mpi_barrier(mpi_comm_world,ierror)
if (id==0) then
do iproc=1,nproc-1
tag=iproc+100
call mpi_recv(df(iproc*floor(1.*np/nproc)+1+iproc*(np-nproc*floor(1.*np/nproc)>=iproc)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc)min((iproc+1)*floor(1.*np/nproc)+(iproc+1)*(np-nproc*floor(1.*np/nproc)>=iproc+1)&
+(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),np-iproc*floor(1.*np/nproc)+1+iproc*(np-nproc*floor(1.*np/nproc)>=iproc)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),mpi_double_precision,iproc,tag,mpi_comm_world,mpi_status,ierror)
enddo
else
tag=id+100
call mpi_send(df(id*floor(1.*np/nproc)+1+id*(np-nproc*floor(1.*np/nproc)>=id)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc)min((id+1)*floor(1.*np/nproc)+(id+1)*(np-nproc*floor(1.*np/nproc)>=id+1)&
+(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),np-id*floor(1.*np/nproc)+1+id*(np-nproc*floor(1.*np/nproc)>=id)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),mpi_double_precision,0,tag, mpi_comm_world,ierror)
endif
call mpi_bcast(df(1),np,mpi_double_precision,0,mpi_comm_world,ierror)
prevdf=df
currtime=secnds(globaltime)
if (id==0) print 9000, floor(currtime/3600) ,floor((currtime-floor(currtime/3600)*3600)/60),currtime-floor(currtime/60)*60

!! Initial guess of the hessian (identity matrix)

identity=0
impar=1
do ipar=1,nop
if (move(ipar)==0) cycle
identity(impar,impar)=1.
impar=impar+1
enddo
ddf=identity


!! MAIN LOOP: DIRECTION-ALPHA-UPDATE PARAMETERS-UPDATE HESSIAN

iter=1
Main_loop: do iter=1,maxiter

!! Find the direction of minimization

tempdir(:,1)=-1.*matmul(ddf,df)
impar=1
do ipar=1,nop
if (move(ipar)==0) then
dir(ipar)=0
else
dir(ipar)=tempdir(impar,1)
impar=impar+1
endif
enddo

!! Line minimization

an=1
if (id==0) then
p0=p
call subroutine2(p0,func0,1)
endif
if (id==min(nproc-1,1)) then
p1=xn+an*dir
call subroutine2(p1,func1,1)
endif
call mpi_bcast(func0,1,mpi_double_precision,0,mpi_comm_world,ierror)
call mpi_bcast(func1,1,mpi_double_precision,min(nproc-1,1),mpi_comm_world,ierror)
call mpi_barrier(mpi_comm_world,ierror)
if (id==0) then
currtime=secnds(globaltime)
write (3,3000) iter,func0, p0
print 3100, iter, floor(currtime/3600) ,floor((currtime-floor(currtime/3600)*3600)/60),currtime-floor(currtime/60)*60, func0, func1
endif
Line_min: do
if (id==0) then
p2=xn+(an+eps)*dir
call subroutine2(p2,func2,1)
endif
if (id==min(nproc-1,1)) then
p3=xn+(an-eps)*dir
call subroutine2(p3,func3,1)
endif
if (id==min(nproc-1-1*(nproc==2),2)) then
p4=xn+(an+2*eps)*dir
call subroutine2(p4,func4,1)
endif
if (id==min(nproc-1,3)) then
p5=xn+(an-2*eps)*dir
call subroutine2(p5,func5,1)
endif
call mpi_bcast(func2,1,mpi_double_precision,0,mpi_comm_world,ierror)
call mpi_bcast(func3,1,mpi_double_precision,min(nproc-1,1),mpi_comm_world,ierror)
call mpi_bcast(func4,1,mpi_double_precision,min(nproc-1-1*(nproc==2),2),mpi_comm_world,ierror)
call mpi_bcast(func5,1,mpi_double_precision,min(nproc-1,3),mpi_comm_world,ierror)
call mpi_barrier(mpi_comm_world,ierror)
adf=(func2-func3)/(2*eps)
addf=(func4-2*func1+func5)/(4*(eps**2))
an1=an-adf/addf
an=an1
if (id==0) then
p1=xn+an*dir
call subroutine2(p1,func1,1)
endif
call mpi_bcast(func1,1,mpi_double_precision,0,mpi_comm_world,ierror)
call mpi_barrier(mpi_comm_world,ierror)
if (func1>func0+c1*an*dot_product(tempdir(:,1),prevdf)) cycle Line_min
if (id==0) then
currtime=secnds(globaltime)
print 4100, floor(currtime/3600) ,floor((currtime-floor(currtime/3600)*3600)/60),currtime-floor(currtime/60)*60, func1
endif
impar=1
!     !$omp parallel do num_threads(1) default(none) private(p3,p4,func3,func4,impar,currtime,ipar) shared(xn,df,an,dir,y,nop,algorithm,globaltime,prevdf,move)
do ipar=1,nop
if (move(ipar)==0) cycle
if ((sum(move(1:ipar))>=id*floor(1.*np/nproc)+1+id*(np-nproc*floor(1.*np/nproc)>=id)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc)(sum(move(1:ipar))<=(id+1)*floor(1.*np/nproc)+(id+1)*(np-nproc*floor(1.*np/nproc)>=id+1)&
+(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc)p3=xn+an*dir
p4=xn+an*dir
p3(ipar)=xn(ipar)+an*dir(ipar)+eps
call subroutine2(p3,func3,1)
currtime=secnds(globaltime)
print 2110, ipar, floor(currtime/3600) ,floor((currtime-floor(currtime/3600)*3600)/60),currtime-floor(currtime/60)*60, func3
p4(ipar)=xn(ipar)+an*dir(ipar)-eps
call subroutine2(p4,func4,1)
currtime=secnds(globaltime)
print 2111, ipar, floor(currtime/3600) ,floor((currtime-floor(currtime/3600)*3600)/60),currtime-floor(currtime/60)*60, func4
impar=sum(move(1:ipar))
y(impar,1)=(func3-func4)/(2*eps)-prevdf(impar)
df(impar)=(func3-func4)/(2*eps)
endif
enddo
!       !$omp end parallel do

call mpi_barrier(mpi_comm_world,ierror)
if (id==0) then
do iproc=1,nproc-1
tag=iproc+500
call mpi_recv(df(iproc*floor(1.*np/nproc)+1+iproc*(np-nproc*floor(1.*np/nproc)>=iproc)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc)min((iproc+1)*floor(1.*np/nproc)+(iproc+1)*(np-nproc*floor(1.*np/nproc)>=iproc+1)&
+(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),np-iproc*floor(1.*np/nproc)+1+iproc*(np-nproc*floor(1.*np/nproc)>=iproc)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),mpi_double_precision,iproc,tag,mpi_comm_world,mpi_status,ierror)
tag=iproc+5000
call mpi_recv(y(iproc*floor(1.*np/nproc)+1+iproc*(np-nproc*floor(1.*np/nproc)>=iproc)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc)min((iproc+1)*floor(1.*np/nproc)+(iproc+1)*(np-nproc*floor(1.*np/nproc)>=iproc+1)&
+(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),np-iproc*floor(1.*np/nproc)+1+iproc*(np-nproc*floor(1.*np/nproc)>=iproc)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),mpi_double_precision,iproc,tag,mpi_comm_world,mpi_status,ierror)
enddo
else
tag=id+500
call mpi_send(df(id*floor(1.*np/nproc)+1+id*(np-nproc*floor(1.*np/nproc)>=id)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc)min((id+1)*floor(1.*np/nproc)+(id+1)*(np-nproc*floor(1.*np/nproc)>=id+1)&
+(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),np-id*floor(1.*np/nproc)+1+id*(np-nproc*floor(1.*np/nproc)>=id)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),mpi_double_precision,0,tag, mpi_comm_world,ierror)
tag=id+5000
call mpi_send(y(id*floor(1.*np/nproc)+1+id*(np-nproc*floor(1.*np/nproc)>=id)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc)min((id+1)*floor(1.*np/nproc)+(id+1)*(np-nproc*floor(1.*np/nproc)>=id+1)&
+(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),np-id*floor(1.*np/nproc)+1+id*(np-nproc*floor(1.*np/nproc)>=id)+&
(np-nproc*floor(1.*np/nproc))*(np-nproc*floor(1.*np/nproc),mpi_double_precision,0,tag, mpi_comm_world,ierror)
endif
call mpi_bcast(df(1),np,mpi_double_precision,0,mpi_comm_world,ierror)
call mpi_bcast(y(1,1),np,mpi_double_precision,0,mpi_comm_world,ierror)
call mpi_barrier(mpi_comm_world,ierror)
if (abs(dot_product(tempdir(:,1),df))<=c2*abs(dot_product(tempdir(:,1),prevdf))) exit Line_min
enddo Line_min
prevdf=df
if (id==0) then
currtime=secnds(globaltime)
print 9000, floor(currtime/3600) ,floor((currtime-floor(currtime/3600)*3600)/60),currtime-floor(currtime/60)*60
endif

!! Update parameters

xn1=xn+an1*dir
p=xn1
if (maxval(abs(xn1-xn))<=stopcr) exit Main_loop
xn=xn1

!! Update the hessian

outer=identity-(1/dot_product(y(:,1),an1*tempdir(:,1)))*matmul(y,transpose(an1*tempdir))
ddf=matmul(matmul(transpose(outer),ddf),outer)+(1/dot_product(y(:,1),an1*tempdir(:,1)))*matmul(an1*tempdir,transpose(an1*tempdir))


enddo Main_loop


!******* finalize mpi environment *******

call mpi_finalize(ierror)

1000 FORMAT (' BROYDEN_FLETCHER_GOLDFARB_SHANNO...' /&
' Iteration    Func.value.         Parameter values')
3000 FORMAT ('  ', i6, '    ', f17.8, '         ', 105f20.10)

1100 FORMAT (' BROYDEN_FLETCHER_GOLDFARB_SHANNO...')
2100 FORMAT ('    Computing the gradient at x0...')
2110 FORMAT ('                     First evaluation of parameter  ', i3, ' at ',i4,':',i3,':',f6.2,' with value function ', f17.8)
2111 FORMAT ('                     Second evaluation of parameter ', i3, ' at ',i4,':',i3,':',f6.2,' with value function ', f17.8)
3100 FORMAT ('    Iteration ', i6, ' started at ',i4,':',i3,':',f6.2,' with value functions ', f17.8, f17.8)
4100 FORMAT ('           First Wolfe criterion met at ',i4,':',i3,':',f6.2, ' with value function', f17.8)
9000 FORMAT ('           ... done at ',i4,':',i3,':',f6.2)


end subroutine

end module[/fortran]

Moreover, notice the following (strange) things:

I switch OpenMP and MPI by uncommenting omp lines and commenting mpi calls (and some if's)
The first (input) argument of subroutine2 is the vector of parameters; the second (output) is the evaluation of the function to be minimized (scalar) and the third one is a flag that determines the size of some arrays in subroutine2.
The code works properly with OpenMP (with 1 and 8 threads in an 8 core machine with 16G of ram).
It doesn't work with MPI even if I comment all MPI calls and I give all processes id=0 (so that all of them do the same). And it even doesn't work if I run with mpiexec -np 1 (i.e. with a single process), either with mpi calls commented and uncommented
Subroutine 1 works perfectly (either with OpenMP and MPI) to minimize the following function:

[fortran]subroutine subroutine2(p,h,n1)
use readdata
implicit none
integer, parameter :: npar=105
real(8), intent(in) :: p(npar)
real(8), intent(out) :: h
integer, intent(in) :: n1



h=sum(p**2)+sum(p**4)

end subroutine[/fortran]

I am not able to reproduce the error with simpler versions of the program

I hope it helps. Let me know about further details that may be necessary.

Grant_H_Intel · ‎03-18-2010

I suspect you may get better advice if you move this thread to the "Intel Clusters and HPC Technology forum here:

http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/

The problems you describe seem to be related to MPI, not OpenMP, if I understand correctly how you are running this program. MPI questions are usually addressed (and monitored) in the above forum instead of this forum.

Hope this helps,

bellman · ‎03-18-2010

Thank you so much. I will do it!

sarajohnson · ‎03-19-2010

Wow, Great Thanks for this