Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2161 Discussions

MPI program hangs when running on multiple cores over IPoIB (Windows)

jackyjngwn
Beginner
565 Views

Hi,

I have an MPI program, which runs fine on a Windows cluster over Ethernet. When I run it using IPoIB and start one processor on each node, there's also no problem. However, when I try to start multiple processes on each node, it hangs. Can anyone tell me what is wrong?

This is the script I use to run the program (there are 4 hosts in total). I am using Intel MPI on Windows RT 4.0.3.009, and the operating system is Windows Server 2008 R2.

set I_MPI_AUTH_METHOD=delegate
set I_MPI_NETMASK=ib
set I_MPI_DEBUG=5
mpiexec.exe -machinefile hosts -n 8 myprogram.exe

Below is the output I get before the program hangs:

[-1] MPI startup(): Rank    Pid      Node name  Pin cpu
[-1] MPI startup(): 0       9652                {0,1,2,3,4,5,6,7,8,9,10,11}
[-1] MPI startup(): I_MPI_DEBUG=5
[-1] MPI startup(): I_MPI_PIN_MAPPING=1:0 0
[4] MPI startup(): The real interface being used for tcp is 'Mellanox IPoIB Adapter' and interface hostname is H001
[4] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): The real interface being used for tcp is 'Mellanox IPoIB Adapter' and interface hostname is H001
[0] MPI startup(): shm and tcp data transfer modes
[7] MPI startup(): The real interface being used for tcp is 'Mellanox IPoIB Adapter' and interface hostname is H007

[2] MPI startup(): The real interface being used for tcp is 'Mellanox IPoIB Adapter' and interface hostname is H004
[2] MPI startup(): shm and tcp data transfer modes

[7] MPI startup(): shm and tcp data transfer modes
[3] MPI startup(): The real interface being used for tcp is 'Mellanox IPoIB Adapter' and interface hostname is H007
[3] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): The real interface being used for tcp is 'Mellanox IPoIB Adapter' and interface hostname is H003

[6] MPI startup(): The real interface being used for tcp is 'Mellanox IPoIB Adapter' and interface hostname is H004
[6] MPI startup(): shm and tcp data transfer modes

[1] MPI startup(): shm and tcp data transfer modes
[5] MPI startup(): The real interface being used for tcp is 'Mellanox IPoIB Adapter' and interface hostname is H003
[5] MPI startup(): shm and tcp data transfer modes
[6] MPI startup(): Internal info: pinning initialization was done
[0] MPI startup(): Internal info: pinning initialization was done
[5] MPI startup(): Internal info: pinning initialization was done
[7] MPI startup(): Internal info: pinning initialization was done

Thanks,

Ling Zhuo

0 Kudos
1 Reply
James_T_Intel
Moderator
565 Views

Hi Ling,

Can you attach a debugger to one of the hung processes and send a stacktrace?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Reply