Software Archive
Read-only legacy content
17060 Discussions

MPI_Init got frozen if launch programs at two MICs in the same host

JS1
Beginner
513 Views

Hi,

I am writing some toy code to test the MPI programming on MICs. My code worked on one Xeon Phi card, also worked between host and one Xeon Phi card, but got frozen if running on two Xeon Phi cards of the same host. The running command is like: mpirun -host mic0 -n 4 /hellompi.MIC : -host mic1 -n 4 /hellompi.MIC.

After digging in, I found that the issue is in MPI_Init(&argc, &argv) function. If i disabled this function, as well as other MPI related functions, the program could be launched in both Xeon Phi cards simultaneously.

Any one knows why? Thanks!

0 Kudos
9 Replies
Gregg_S_Intel
Employee
513 Views

Does this run on each card, individually?  That is, do both of these complete?

mpirun -host mic0 -n 4 a.out

mpirun -host mic1 -n 4 a.out

0 Kudos
JS1
Beginner
513 Views

Yes, running on an individual card worked well. I even tried the Mento Carlo code provided by your instruction (http://software.intel.com/en-us/articles/using-the-intel-mpi-library-on-intel-xeon-phi-coprocessor-systems#viewSource), it also only worked on indivisual cards but not two simultaneously. After waiting for a while, it would return some error message. The screen shot is as below (I used mpiexec.hydra in this case):

[prompt]# /opt/intel/impi/4.1.1/bin64/mpiexec.hydra -host mic0 -n 1 /micfs/mc.MIC : -host mic1 -n 1 /micfs/mc.MIC
rank = 1, revents = 29, state = 1
Assertion failed in file ../../socksm.c at line 2963: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0
[mpiexec@hostname] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:98): one of the processes terminated badly; aborting
[mpiexec@hostname] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@hostname] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:440): launcher returned error waiting for completion
[mpiexec@hostname] main (./ui/mpich/mpiexec.c:847): process manager error waiting for completion

Gregg Skinner (Intel) wrote:

Does this run on each card, individually?  That is, do both of these complete?

mpirun -host mic0 -n 4 a.out

mpirun -host mic1 -n 4 a.out

0 Kudos
Gregg_S_Intel
Employee
513 Views

Have the cards been configured for networking?  You should be able to ssh from one card to the other. 

Set up a static or DHCP bridge, as described in the Cluster Setup Guide, which is found in the Intel(R) MPSS docs directory.

 

0 Kudos
TimP
Honored Contributor III
513 Views

Gregg Skinner (Intel) wrote:

Have the cards been configured for networking?  You should be able to ssh from one card to the other. 

Set up a static or DHCP bridge, as described in the Cluster Setup Guide, which is found in the Intel(R) MPSS docs directory.

 

In particular, there is a useful sshconnectivity script in the unpacked installation directory for Intel MPI.  Once the coprocessor is running, this script should be run for each user and for root, followed by sudo service mpss stop; sudo micctrl --resetconfig, sudo mpss start.

I spent an extra hour this week on mpss upgrade; the short of the resolution is that none of the convoluted steps in the readme can be skipped, including micctrl --initconfig, followed eventually by the sshconnectivity and resetconfig steps..

Loook at interesting on-line docs, such as:

http://software.intel.com/sites/default/files/forum/393956/intelr-mpss-for-linux-troubleshoot-flow-chart.pdf

0 Kudos
JS1
Beginner
513 Views

Great! That solved the issue. Thanks!

Gregg Skinner (Intel) wrote:

Have the cards been configured for networking?  You should be able to ssh from one card to the other. 

Set up a static or DHCP bridge, as described in the Cluster Setup Guide, which is found in the Intel(R) MPSS docs directory.

 

0 Kudos
JS1
Beginner
513 Views

Great! That solved the issue. Thanks!

Gregg Skinner (Intel) wrote:

Have the cards been configured for networking?  You should be able to ssh from one card to the other. 

Set up a static or DHCP bridge, as described in the Cluster Setup Guide, which is found in the Intel(R) MPSS docs directory.

 

0 Kudos
JS1
Beginner
513 Views

Thanks! Very useful information. BTW, can you elaborate a little bit about how to use sshconnectivity script? e.g. what is the filename the script expects?

TimP (Intel) wrote:

Quote:

Gregg Skinner (Intel)wrote:

Have the cards been configured for networking?  You should be able to ssh from one card to the other. 

Set up a static or DHCP bridge, as described in the Cluster Setup Guide, which is found in the Intel(R) MPSS docs directory.

 

In particular, there is a useful sshconnectivity script in the unpacked installation directory for Intel MPI.  Once the coprocessor is running, this script should be run for each user and for root, followed by sudo service mpss stop; sudo micctrl --resetconfig, sudo mpss start.

I spent an extra hour this week on mpss upgrade; the short of the resolution is that none of the convoluted steps in the readme can be skipped, including micctrl --initconfig, followed eventually by the sshconnectivity and resetconfig steps..

Loook at interesting on-line docs, such as:

http://software.intel.com/sites/default/files/forum/393956/intelr-mpss-f...

0 Kudos
TimP
Honored Contributor III
513 Views

sshconnectivity reads a file which you create containing a list of the names of the nodes for setting up ssh, such as the names from /etc/hosts on all the nodes (omitting IP addresses).  I believe it's mentioned briefly in the MPI setup doc.

I have a ptatform on which I have to run network restart on the host first; due apparently to some out of sequencing on initial power up, it doesn't set the correct host IP address, but network restart (or hot reboot) will correct it.

0 Kudos
JS1
Beginner
513 Views

Thank you!

TimP (Intel) wrote:

sshconnectivity reads a file which you create containing a list of the names of the nodes for setting up ssh, such as the names from /etc/hosts on all the nodes (omitting IP addresses).  I believe it's mentioned briefly in the MPI setup doc.

I have a ptatform on which I have to run network restart on the host first; due apparently to some out of sequencing on initial power up, it doesn't set the correct host IP address, but network restart (or hot reboot) will correct it.

0 Kudos
Reply