- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I have hanging problem with intel mpi 2019.9.304 like this.
System Information
uname -r :: 4.18.0-240.10.1.el8_3.x86_64
ifort -v :: 19.1.3.304 or 2021.2.0
mpirun -V :: 2019 Update 9 Build 20200923 or 2021.2 Build 20210302
lsf :: 10.1.0.0
ofed_info :: MLNX_OFED_LINUX-5.3-1.0.0.1
Used Code(example.f90)
program example
implicit none
include "mpif.h"
integer :: rank, size
real :: sum, n
integer :: i, j, ierr
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr)
call MPI_Comm_size(MPI_COMM_WORLD,size,ierr)
n=0.0
sum=0.0
do i=rank+1,100000,size
!$omp parallel do private(j) reduction(+:n)
do j= 1,1000000
n = n + real(i) + real(j)
end do
!$omp end parallel do
end do
print *, 'MY Rank:', rank, 'MY Part:',n
call MPI_Reduce(n,sum,1,MPI_REAL8,MPI_SUM,0,MPI_COMM_WORLD,ierr)
if(rank == 0) print *, 'PE:', rank, 'total is :', sum
call MPI_Finalize(ierr)
print *, 'End of Code: MyRank: ', rank
end program example
Compile and Run Commands
$ mpiifort -r8 -qopenmp -o example.exe ./example.f90
$ bsub -J Test -n 2 -R "span[ptile=1] affinity[core(2)]" \
mpirun -n 2 ./example.exe
Simulation
I was repeated the following 4 cases 10 times.
Case 1 : ifort 19.1.3.304 / intel mpi 2019.9.304
Case 2 : ifort 2021.2.0 / intel mpi 2019.9.304
Case 3 : ifort 19.1.3.304 / intel mpi 2021.2.0
Case 4 : ifort 2021.2.0 / intel mpi 2021.2.0
In Cases 3 and 4, it is performed normally. However, Cases 1 and 2 make hanging problem on MPI_Finalize(I thought based on printed line after MPI_Finalize).
Cases 3 and 4 all had good results below,
and Case 1 had good results 3 out of 10
and Case 2 had good results 4 out of 10.
Good Result Example
MY Rank: 0 MY Part: 2.750002500000000E+016
MY Rank: 1 MY Part: 2.750007500000000E+016
PE: 0 total is : 5.500010000000000E+016
End of Code: MyRank: 1
End of Code: MyRank: 0
Bad Result Example(with wall-time error)
MY Rank: 1 MY Part: 2.750007500000000E+016
MY Rank: 0 MY Part: 2.750002500000000E+016
PE: 0 total is : 5.500010000000000E+016
End of Code: MyRank: 0
User defined signal 2
Question
Could I get some solutions about this problem?
And I tried somethings referring to MPI program hangs in "MPI_Finalize" , but some environment variables(I_MPI_HYDRA_BRANCH_COUNT, I_MPI_LSF_USE_COLLECTIVE_LAUNCH) not working.
Thanks in advance.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for providing the reproducible code with all specifications and expected output.
However, we have checked all 4 cases you mentioned multiple times but we are unable to reproduce your issue(error).
Could you please provide the debug information/error log for the "Bad Result Example" to understand your issue better?
Use the below command for providing debug information:
I_MPI_DEBUG=10 mpirun -n 2 ./example.exe
Thanks & Regards
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for your reply, and sorry for the late reply.
I found something interesting while adjusting I_MPI_DEBUG option, so I've been experimenting over the past few days.
As a result, there is no problem(hanging on MPI_Finalize) with I_MPI_DEBUG >= 3.
What's the difference depending on I_MPI_DEBUG options except for printing debug information?
I'm attaching the details(some logs) below to share information.
- I_MPI_DEBUG_2.out : stdout with I_MPI_DEBUG=2,
- I_MPI_DEBUG_2.err : stderr with I_MPI_DEBUG=2, when the problem occured (about 4/10 in my system)
- I_MPI_DEBUG_3.out : stdout with I_MPI_DEBUG=3, it had empty stderr (I only concealed my node name of rank 0)
- I_MPI_DEBUG_10.out : stdout with I_MPI_DEBUG=10, it had empty stderr (I concealed my node name of of rank 0 and path in I_MPI_ROOT)
Thanks and Regards
HG Choe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for providing the required information.
Could you please run your code using the below command:
mpiexec.hydra -n 2 -ppn 2 ./example.exe
And, also could you please let us know if you are able to get the expected outcome?
Thanks & Regards
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Varsha.
I think you intended to use only one node by the ppn option, is that right?
In conclusion, it is worked well with ppn option(modified ptile by lsf).
bsub -J Test -n 2 -R "span[ptile=2] affinity[core(2)]" \
mpiexec.hydra -n 2 -ppn 2 ./example.exe
But I need to run with multiple nodes.
My hybrid application(mpi+openmp) is targeted n >= 10 and omp_num_threads = 76.
(ex) Below run command makes hanging problem.
bsub -J Test -n 10 -R "span[ptile=1] affinity[core(76)]" \
OMP_NUM_THREADS=76 mpiexec.hydra -n 10 ./example.exe
Thanks and Regards
HG Choe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Choe,
Thanks for providing the information.
Could you please try the following points mentioned below and let us know the behavior/outcome:
-->[0] MPI startup(): library kind: release_mt
1. We have observed from the debug information that you are using library kind is "release_mt". Could you please let us know if you are facing the same issue with the library kind is "release".
2. Could you please let us know if this issue is specific to this application or everyother applications? Could you please try running the IMB-MPI1 benchmark and let us know the output?
mpiexec.hydra -np 2 -ppn 1 IMB-MPI1 allreduce
3. If you are still facing the issue could you please provide debug trace of the MPI process which hangs by using the GDB Debug tool.
4. And also, Could you please try using the interactive shell and let us know if you are able to get the expected results?
Thanks & Regards
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you. Could you please provide an update on your issue?
Thanks & Regards
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
First of all, I'm sorry for the late reply again.
Fortunately, we found some solutions in the meantime.
1. To update libfabric to the latest(OpenFabrics 1.13.1, Release Release v1.13.1 · ofiwg/libfabric · GitHub)
(cf) Intel® MPI Library 2019 Over Libfabric*
2. To add the MPI_Barrier function before MPI_Finalize
call MPI_Reduce(n,sum,1,MPI_REAL8,MPI_SUM,0,MPI_COMM_WORLD,ierr)
if(mype == 0) print *, 'PE:', mype, 'total is :', sum
call MPI_Barrier(MPI_COMM_WORLD,ierr)
call MPI_Finalize(ierr)
So we've applied solution 1(the lastest libfabric) to our system and the hang problem is gone.
(We thought that solution 2(MPI_Barrier) is something like naive or clumsy.)
In my personal opinion, I have doubts about the compatibility between CentOS(8.3), OFED(5.3) and libfabric(1.10.1).
I think we can close this issue.
Additionally, I would appreciate it if you guys provide the information of the compatibility through a reply.
(if you guys have some report or known issue)
* Some information about previous reply.
1. We tried 'release', 'release_mt', 'debug' and 'debug_mt', and all have same problem.
And then all work good with the latest libfabric(1.13.1).
2. The case of "IMB-MPI1 allreduce" was same(It works with the latest libfabric).
3&4. I cound not do anything about GDB Debug and using the interative shell because of our politics.
(I cannot approach the compute nodes without lsf)
Thanks for taking the time to review my issue
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
>>(if you guys have some report or known issue)
Could you please refer to the below link for the Intel MPI updates, known issues, and system requirements.
Glad to know that your issue is resolved. Thanks for sharing the solution with us. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Thanks & Regards
Varsha
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page