Intel® HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2153 Discussions

Issues running on Ubuntu10.04.2 LTS

AGG1
Beginner
612 Views
We have a software that runs as hybrid MPI-OpenMP. We have tested it on RHEL 4, 5 and SUSE 10, 11. We have a client trying to run it onUbuntu10.04.2 LTS and it fails. The application is written in Fortran, compiled with Intel Fortran 11.1 and Intel MPI 4. We compile with command mpiifort and use -openmp flags. However on this Ubuntu system the only way we can get it running is if we don't use the -openmp flag and at linking stage use the flag -liomp5.
What is the explanation for this?
Thanks!
0 Kudos
5 Replies
James_T_Intel
Moderator
612 Views

Hi,

I am currently attempting to reproduce the behavior you are experiencing on our systems. Can you please provide me with a sample that reproduces this behavior? Thank you.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

0 Kudos
AGG1
Beginner
612 Views
Hi Jim,
We have a laplace solver (hybrid openmp-mpi) that crashes on the system. On the ubuntu system this will run in pure MPI mode if we don't use the -openmp compiler flag. In hybrid mode however it runs only if the arrays u and du are allocated dynamically. If allocated statically it runs if imax and jmax have a value of less than 700 each. We checked 'ulimit -a' and found thatmax locked memory was set to 64. We changed it to unlimited but we still get crashes.
We've tried intel mpi 4 as well as mpich2-1.4. The behavior is the same with both in hybrid mode.
mpiifort/mpif90 -openmp laplace.f
Thanks,
Anup
Here is the code: laplace.f
program lpmlp
include 'mpif.h'
include "omp_lib.h"
integer imax,jmax,im1,im2,jm1,jm2,it,itmax
!parameter (imax=2001,jmax=2001)
parameter (imax=10,jmax=10)
parameter (im1=imax-1,im2=imax-2,jm1=jmax-1,jm2=jmax-2)
parameter (itmax=100)
real*8 u(imax,jmax),du(imax,jmax),umax,dumax,tol,pi
parameter (umax=10.0,tol=1.0e-6,pi=3.14159)
! Additional MPI parameters
integer istart,iend,jstart,jend
integer size,rank,ierr,istat(MPI_STATUS_SIZE),mpigrid,length
integer grdrnk,dims(1),gloc(1),up,down,isize,jsize
integer ureq,dreq
integer ustat(MPI_STATUS_SIZE),dstat(MPI_STATUS_SIZE)
real*8 tstart,tend,gdumax
logical cyclic(1)
real*8 uibuf(imax),uobuf(imax),dibuf(imax),dobuf(imax)
! OpenMP parameters
integer nthrds
! Initialize
call MPI_INIT_THREAD(MPI_THREAD_FUNNELED,IMPI_prov,ierr)
!call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)
! 1D linear topology
dims(1)=size
cyclic(1)=.FALSE.
call MPI_CART_CREATE(MPI_COMM_WORLD,1,dims,cyclic,.true.,mpigrid
+ ,ierr)
call MPI_COMM_RANK(mpigrid,grdrnk,ierr)
call MPI_CART_COORDS(mpigrid,grdrnk,1,gloc,ierr)
call MPI_CART_SHIFT(mpigrid,0,1,down,up,ierr)
istart=2
iend=imax-1
jsize=jmax/size
jstart=gloc(1)*jsize+1
if (jstart.LE.1) jstart=2
jend=(gloc(1)+1)*jsize
if (jend.GE.jmax) jend=jmax-1
nthrds=OMP_GET_NUM_PROCS()
print*,"Rank=",rank,"Threads=",nthrds
call omp_set_num_threads(nthrds)
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)
! Initialize -- done in parallel to force "first-touch" distribution
! on ccNUMA machines (i.e. O2k)
!$OMP DO
do j=jstart-1,jend+1
do i=istart-1,iend+1
u(i,j)=0.0
du(i,j)=0.0
enddo
u(imax,j)=umax*sin(pi*float(j-1)/float(jmax-1))
enddo
!$OMP END DO
!$OMP END PARALLEL
! Main computation loop
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
tstart=MPI_WTIME()
do it=1,itmax
! We have to keep the OpenMP and MPI calls segregated...
call omp_set_num_threads(nthrds)
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)
!$OMP MASTER
dumax=0.0
!$OMP END MASTER
!$OMP DO REDUCTION(max:dumax)
do j=jstart,jend
do i=istart,iend
du(i,j)=0.25*(u(i-1,j)+u(i+1,j)+u(i,j-1)+u(i,j+1))-u(i,j)
dumax=max(dumax,abs(du(i,j)))
enddo
enddo
!$OMP END DO
!$OMP DO
do j=jstart,jend
do i=istart,iend
u(i,j)=u(i,j)+du(i,j)
enddo
enddo
!$OMP END DO
!$OMP END PARALLEL
! Compute the overall residual
call MPI_REDUCE(dumax,gdumax,1,MPI_REAL8,MPI_MAX,0
+ ,MPI_COMM_WORLD,ierr)
! Send phase
if (down.NE.MPI_PROC_NULL) then
j=1
do i=istart,iend
dobuf(j)=u(i,jstart)
j=j+1
enddo
length=j-1
call MPI_ISEND(dobuf,length,MPI_REAL8,down,it,mpigrid,
+ dreq,ierr)
endif
if (up.NE.MPI_PROC_NULL) then
j=1
do i=istart,iend
uobuf(j)=u(i,jend)
j=j+1
enddo
length=j-1
call MPI_ISEND(uobuf,length,MPI_REAL8,up,it,mpigrid,
+ ureq,ierr)
endif
! Receive phase
if (down.NE.MPI_PROC_NULL) then
length=iend-istart+1
call MPI_RECV(dibuf,length,MPI_REAL8,down,it,
+ mpigrid,istat,ierr)
call MPI_WAIT(dreq,dstat,ierr)
j=1
do i=istart,iend
u(i,jstart-1)=dibuf(j)
j=j+1
enddo
endif
if (up.NE.MPI_PROC_NULL) then
length=iend-istart+1
call MPI_RECV(uibuf,length,MPI_REAL8,up,it,
+ mpigrid,istat,ierr)
call MPI_WAIT(ureq,ustat,ierr)
j=1
do i=istart,iend
u(i,jend+1)=uibuf(j)
j=j+1
enddo
endif
write (rank+10,*) rank,it,dumax,gdumax
if (rank.eq.0) write (1,*) it,gdumax
enddo
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
tend=MPI_WTIME()
if (rank.EQ.0) then
write(*,*) 'Calculation took ',tend-tstart,'s. on ',size,
+ ' MPI processes'
+ ,' with ',nthrds,' OpenMP threads per process'
endif
call MPI_FINALIZE(ierr)
stop
end
0 Kudos
TimP
Honored Contributor III
612 Views
I think you're telling us that mpiifort isn't using the -openmp flag to imply -liomp5. You could verify this by adding -# to the mpiifort options at the failing step and showing the resulting diagnostics. That might also help to see whether you have mixed up 32- and 64-bit modes and paths, which is easy to do on Ubuntu since it doesn't conform to the normal way of handling 64-bit paths.
Needless to say, if you specify any of the system library directories in your Makefile, you must adjust those to the different locations used by Ubuntu.
0 Kudos
jackdgalloway
Beginner
612 Views
An updated regarding this question (I am the user who is having problems). We have upgraded to Ubuntu 11.10 as we needed to and hoped in teh process the problem would be resolved but it wasn't.
The current status of the problems, I'm trying to run a hybrid mpi(using mpich2)/openmp problem, and have a test f77 program that causes the same problem.
(1) If I run the problem from a specific node (say "node1"), and only include "node1" in the machines file then the code runs fine
(2) If I run without including "-openmp", the code runs fine across nodes (if the machines file has "node1", "node2" etc...), however it obviously isn't threaded
(3) If I run across nodes ("node1", "node2" etc...) compiling with "-openmp" I get segmentation faults and crashes.
In searching this further I discovered a problem with the statically linked arrays. If the arrays are "too big" the program crashes with the seg fault, however if it's small "enough", it runs to completion fine. As a hunch I converted the program to dynamic memory allocation which solved the problem no matter how big the arrays were. However, I do not have the capability of changing the parent, much larger, program thus this solution is not a long term one, but hopefully helps determine the problem. Attached here is the test.f file that I use. If imax=jmax>=721 the code crashes, but at 720 or smaller it works fine (a size of 10,000 worked fine in the dynamic allocation test).
A couple other items worth mentioning, the stack size is set to unlimited, verified through "ulimit -a". I have changed the kmp_stacksize to be upwards of 4G and 8G to make sure this was not an issue and it worked fine. One final thing. I tried linking "-liomp5" at compile instead of "-openmp" which I thought worked as no segmentation faults occurred and the code ran to completion, however this was a partial lie as no threading was taking place, CPU utilization never went to 800%, even though "nthrds=OMP_GET_NUM_PROCS()" set nthrds to 8 (verified a number of places), when "call omp_set_num_threads(nthrds)" was invoked, with nthrds being 8, it always would set the number of threads to 1, verified by printing out:
call omp_set_num_threads(nthrds)
nthreads = OMP_GET_NUM_THREADS()
print*,"Jack",rank,nthreads,nthrds
And seeing nthrds = 8, but nthreads = 1. I gave up on the liomp5 linking at this point as I believe the issue is more related to something manifest in the static vs dynamic memory allocation issues, however I'm somewhat stumped.
[fxfortran]      program lpmlp
      include 'mpif.h'
      include "omp_lib.h" 

      integer imax,jmax,im1,im2,jm1,jm2,it,itmax
      parameter (imax=10000,jmax=10000)
      parameter (im1=imax-1,im2=imax-2,jm1=jmax-1,jm2=jmax-2)
      parameter (itmax=100)
      real*8 u(imax,jmax),du(imax,jmax),umax,dumax,tol,pi
      parameter (umax=10.0,tol=1.0e-6,pi=3.14159)
! Additional MPI parameters
      integer istart,iend,jstart,jend
      integer size,rank,ierr,mpigrid,length
      integer grdrnk,dims(1),gloc(1),up,down,isize,jsize
      integer ureq,dreq
      integer ustat(MPI_STATUS_SIZE),dstat(MPI_STATUS_SIZE)
      integer istat(MPI_STATUS_SIZE)
      real*8 tstart,tend,gdumax
      logical cyclic(1)
      real*8 uibuf(imax),uobuf(imax),dibuf(imax),dobuf(imax)
! OpenMP parameters
      integer nthrds,nthreads      

! Initialize
      call MPI_INIT_THREAD(MPI_THREAD_FUNNELED,IMPI_prov,ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)

      print*, "hello from", rank
      !call sleep(180) 

! 1D linear topology
      dims(1)=size
      cyclic(1)=.FALSE.
      call MPI_CART_CREATE(MPI_COMM_WORLD,1,dims,cyclic,.true.,mpigrid
     +     ,ierr)
      call MPI_COMM_RANK(mpigrid,grdrnk,ierr)
      call MPI_CART_COORDS(mpigrid,grdrnk,1,gloc,ierr)
      call MPI_CART_SHIFT(mpigrid,0,1,down,up,ierr)
      istart=2
      iend=imax-1
      jsize=jmax/size
      jstart=gloc(1)*jsize+1
      if (jstart.LE.1) jstart=2
      jend=(gloc(1)+1)*jsize
      if (jend.GE.jmax) jend=jmax-1
      nthrds=OMP_GET_NUM_PROCS()
      print*,"Rank=",rank,"Threads=",nthrds
      call omp_set_num_threads(nthrds)
                    
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)
! Initialize -- done in parallel to force "first-touch" distribution
! on ccNUMA machines (i.e. O2k)
!$OMP DO
      do j=jstart-1,jend+1
         do i=istart-1,iend+1
            u(i,j)=0.0
            du(i,j)=0.0
         enddo
         u(imax,j)=umax*sin(pi*float(j-1)/float(jmax-1))
      enddo
!$OMP END DO
!$OMP END PARALLEL

! Main computation loop
      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      tstart=MPI_WTIME()
      do it=1,itmax
! We have to keep the OpenMP and MPI calls segregated...
        call omp_set_num_threads(nthrds)
        !nthreads = OMP_GET_NUM_THREADS()
        !print*,"Jack",rank,nthreads,nthrds
               
        
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)
!$OMP MASTER
        dumax=0.0
!$OMP END MASTER
!$OMP DO REDUCTION(max:dumax)
         do j=jstart,jend
            do i=istart,iend
               !nthreads = OMP_GET_NUM_THREADS()
               !print*,"Jack",rank,nthreads,nthrds
               du(i,j)=0.25*(u(i-1,j)+u(i+1,j)+u(i,j-1)+u(i,j+1))-u(i,j)
               dumax=max(dumax,abs(du(i,j)))
            enddo
         enddo
!$OMP END DO
!$OMP DO
         do j=jstart,jend
            do i=istart,iend
               u(i,j)=u(i,j)+du(i,j)
            enddo
         enddo
!$OMP END DO
!$OMP END PARALLEL
! Compute the overall residual
         call MPI_REDUCE(dumax,gdumax,1,MPI_REAL8,MPI_MAX,0
     +        ,MPI_COMM_WORLD,ierr)

! Send phase
         if (down.NE.MPI_PROC_NULL) then
            j=1
            do i=istart,iend
               dobuf(j)=u(i,jstart)
               j=j+1
            enddo
            length=j-1
            call MPI_ISEND(dobuf,length,MPI_REAL8,down,it,mpigrid,
     +           dreq,ierr)
         endif
         if (up.NE.MPI_PROC_NULL) then
            j=1
            do i=istart,iend
               uobuf(j)=u(i,jend)
               j=j+1
            enddo
            length=j-1
            call MPI_ISEND(uobuf,length,MPI_REAL8,up,it,mpigrid,
     +           ureq,ierr)
         endif
! Receive phase
         if (down.NE.MPI_PROC_NULL) then
            length=iend-istart+1
            call MPI_RECV(dibuf,length,MPI_REAL8,down,it,
     +           mpigrid,istat,ierr)
            call MPI_WAIT(dreq,dstat,ierr)
            j=1
            do i=istart,iend
               u(i,jstart-1)=dibuf(j)
               j=j+1
            enddo
         endif
         if (up.NE.MPI_PROC_NULL) then
            length=iend-istart+1
            call MPI_RECV(uibuf,length,MPI_REAL8,up,it,
     +           mpigrid,istat,ierr)
            call MPI_WAIT(ureq,ustat,ierr)
            j=1
            do i=istart,iend
               u(i,jend+1)=uibuf(j)
               j=j+1
            enddo
         endif
         write (rank+10,*) rank,it,dumax,gdumax
         if (rank.eq.0) write (1,*) it,gdumax
      enddo
      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      tend=MPI_WTIME()
      if (rank.EQ.0) then
         write(*,*) 'Calculation took ',tend-tstart,'s. on ',size,
     +        ' MPI processes'
     +        ,' with ',nthrds,' OpenMP threads per process'
      endif
      call MPI_FINALIZE(ierr)
      stop
      end
[/fxfortran]
0 Kudos
James_T_Intel
Moderator
612 Views
Hi Jack,

As you have found, simply linking to the OpenMP* library (using -liomp5) is insufficient. This will enable use of the OpenMP* functions, but compiler directives will be ignored. By fully enabling OpenMP* (-openmp) the compiler directives will be utilized.

It appears that your primary issue is in memory usage. You have stated that you have set your stack to unlimited. Is this a 32 bit or 64 bit system? If you are on a 64 bit system, are you using the 32 or 64 bit Intel Fortran Compiler?

I have compiled the code you provided. On our Ubuntu 10.04 system, I am able to run this code with a sufficiently large stack. Having a stack size limit too small will prevent the program from executing. The environment variables KMP_STACKSIZE and OMP_STACKSIZE were not set. I would recommend you verify that an unlimited stack size is really unlimited, or if there are other system constraints preventing your program from running. This could include other programs running simultaneously.

The Intel Software Network has two knowledge base articles that may help you with this issue.

http://software.intel.com/en-us/articles/openmp-option-no-pragmas-causes-segmentation-fault/
http://software.intel.com/en-us/articles/intel-fortran-compiler-increased-stack-usage-of-80-or-higher-compilers-causes-segmentation-fault/

Please let me know if any of this information helps address your issue.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Reply