- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I thought I was done with communication problems, but I guess not. I just installed another three phi cards on a workstation, but I'm having the following communication problem. I can ssh from the host to any phi card and back, but I can't ssh from one card to another, such as from mic0 to mic1. I can't even ping one card from another card. I assume that all cards should be able to communicate with all others on the same host. The default TCP addresses are being used (172.31.1.254 for host, 172.31.1.1 for mic0, 172.31.2.1 for mic1, etc).
Any ideas? Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
All is explained in the Intel® Manycore Platform Software Stack (Intel® MPSS) Boot Configuration Guide (I am so glad the folks in the MPSS group put that manual together - I am always looking stuff up in that myself.) Section 4.4.4.4 on internal bridges. To get the coprocessors within a single host to talk to each other, you need to set up a internal bridge on the host. After you look at that section, go down to the section on micctrl and look at the --network option. You might find it easier to use that option than to change the micN.conf files.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks again Frances! I guess i should RTFM. I will try the micctrl option, but first I must take a cocktail break...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a problem using two cards with host for MPI job. When I use host and one card, it works.
$ mpirun -host localhost -n 1 ./impact : -host mic0 -wdir /home/wangx147/impact_run/ -env LD_LIBRARY_PATH /home/wangx147/mic/lib -n 4 /home/wangx147/impact_run/xmain.MIC
# of lines before beam line elements: 11 13
onblem = 1258
Start simulation:
nblem: 1258 1258
But if I use both cards, it hangs for a while and output error message.
$ mpirun -host localhost -n 1 ./impact : -host mic0 -wdir /home/wangx147/impact_run/ -env LD_LIBRARY_PATH /home/wangx147/mic/lib -n 2 /home/wangx147/impact_run/xmain.MIC : -host mic1 -wdir /home/wangx147/impact_run/ -env LD_LIBRARY_PATH /home/wangx147/mic/lib -n 2 /home/wangx147/impact_run/xmain.MIC
rank = 3, revents = 29, state = 1
rank = 3, revents = 29, state = 1
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2973: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2973: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0
forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
xmain.MIC 00000000008DEBD1 Unknown Unknown Unknown
xmain.MIC 00000000008DCF67 Unknown
......
Any suggestion? Thanks in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You don't say if you are using tcp or ofed, but lets assume tcp. You can ssh from the host to either coprocessor, right? Have you tried to ssh into one of the coprocessors and then ssh from that coprocessor into the other? If not, make sure you have a bridge on the host connecting the host and both coprocessors and that the /etc/hosts file on each coprocessor has the correct address for the other one. The instructions for setting up the bridge is now in the MPSS Users Guide - look for micctrl Based Internal Bridge Configuration Implementation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additionally I'd recommend to simplify your MPI command line by changing the MPI application with 'hostname' utility - just to check that there aren't any connectivity issues and MPI process launcher works fine.
As far as I remember such failure might happen on some network configurations (if not mistaken on static pair topology) if IP forwarding is disabled:
# cat /proc/sys/net/ipv4/ip_forward
0
You can enable it with the following command (host reboot will flush these settings to the default value):
# echo 1 > /proc/sys/net/ipv4/ip_forward
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, Frances.
tcp is used as far as I know. I can ssh from host to either card. But I can not ssh from one card to the other. I will read the MPSS Users Guide. By the way, How do I make sure if the bridge is correctly configured? Do I need to have root privilege?
Frances Roth (Intel) wrote:
You don't say if you are using tcp or ofed, but lets assume tcp. You can ssh from the host to either coprocessor, right? Have you tried to ssh into one of the coprocessors and then ssh from that coprocessor into the other? If not, make sure you have a bridge on the host connecting the host and both coprocessors and that the /etc/hosts file on each coprocessor has the correct address for the other one. The instructions for setting up the bridge is now in the MPSS Users Guide - look for micctrl Based Internal Bridge Configuration Implementation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You need root privileges to set up a bridge but not to check on it.
Use the command 'brctl show' and look for a bridge containing the mic0 and mic1 interfaces. If no bridge shows up, then someone with root privileges will need to set one up. Use the command 'ifconfig <bridge_name>' where <bridge_name> is the name of your bridge and look for the word 'UP'. If the output says the bridge is down, then someone with root privileges will need to bring the bridge up. If the bridge is up, check the /etc/hosts file on each coprocessor. The /etc/hosts file on each coprocessor must contain the name and address for the host and for the other coprocessor. If it does not, then someone with root privileges will need to fix the configuration.
Hope that helps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much for your reply, Frances.
Now the connection is reconfigured and I can ssh between cards and the host. But here come another problem and I don't know what is the source. I attached some output for your information. Is it mpirun problem that could be fix by using proper options?
[wangx147@csp-015 ~]$
[wangx147@csp-015 ~]$ mpirun -host mic0 -n 5 -wdir /home/wangx147/impact_run/ -env LD_LIBRARY_PATH /home/wangx147/mic/lib /home/wangx147/impact_run/xmain.MIC
[proxy:0:0@csp-015-mic0] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "csp-015-mic0" to "192.168.6.79" (Network is unreachable)
[proxy:0:0@csp-015-mic0] main (../../pm/pmiserv/pmip.c:372): unable to connect to server 192.168.6.79 at port 36020 (check for firewalls!)
[wangx147@csp-015 ~]$
[wangx147@csp-015 ~]$ echo $SSH_CONNECTION
192.168.6.64 59197 192.168.6.79 22
[wangx147@csp-015 ~]$
[wangx147@csp-015 ~]$ ssh mic0
[wangx147@csp-015-mic0 ~]$ echo $SSH_CONNECTION
10.1.7.79 60220 10.1.7.100 22
[wangx147@csp-015-mic0 ~]$
[wangx147@csp-015-mic0 ~]$ exit
logout
Connection to mic0 closed.
[wangx147@csp-015 ~]$ micctrl --config
mic0:
=============================================================
Config Version: 1.1
Linux Kernel: /usr/share/mpss/boot/bzImage-knightscorner
BootOnStart: Enabled
Shutdowntimeout: 300 seconds
ExtraCommandLine: highres=off
PowerManagment: cpufreq_on;corec6_off;pc3_on;pc6_off
Root Device: Dynamic Ram Filesystem /var/mpss/mic0.image.gz from:
Base: CPIO /usr/share/mpss/boot/initramfs-knightscorner.cpio.gz
Overlay Filelist /opt/intel/mic/amplxe /opt/intel/mic/amplxe/amplxe.filelist on
Overlay Filelist /opt/intel/mic/vtsspp /opt/intel/mic/vtsspp/vtsspp.filelist on
Overlay Filelist /var/mpss/itt_lib /var/mpss/itt_lib/itt.filelist on
CommonDir: Directory /var/mpss/common
Micdir: Directory /var/mpss/mic0
Network: Static bridge br0
MIC IP: 10.1.7.100
Host IP: 10.1.7.100
Net Bits: 24
NetMask: 255.255.255.0
MtuSize: 1500
Hostname: csp-015-mic0
MIC MAC: 4c:79:ba:32:04:fc
Host MAC: 4c:79:ba:32:04:fd
Cgroup:
Memory: Disabled
Console: hvc0
VerboseLogging: Disabled
CrashDump: /var/crash/mic 16GB
mic1:
=============================================================
Config Version: 1.1
Linux Kernel: /usr/share/mpss/boot/bzImage-knightscorner
BootOnStart: Enabled
Shutdowntimeout: 300 seconds
ExtraCommandLine: highres=off
PowerManagment: cpufreq_on;corec6_off;pc3_on;pc6_off
Root Device: Dynamic Ram Filesystem /var/mpss/mic1.image.gz from:
Base: CPIO /usr/share/mpss/boot/initramfs-knightscorner.cpio.gz
Overlay Filelist /opt/intel/mic/amplxe /opt/intel/mic/amplxe/amplxe.filelist on
Overlay Filelist /opt/intel/mic/vtsspp /opt/intel/mic/vtsspp/vtsspp.filelist on
Overlay Filelist /var/mpss/itt_lib /var/mpss/itt_lib/itt.filelist on
CommonDir: Directory /var/mpss/common
Micdir: Directory /var/mpss/mic1
Network: Static bridge br0
MIC IP: 10.1.7.101
Host IP: 10.1.7.100
Net Bits: 24
NetMask: 255.255.255.0
MtuSize: 1500
Hostname: csp-015-mic1
MIC MAC: 4c:79:ba:32:04:e6
Host MAC: 4c:79:ba:32:04:e7
Cgroup:
Memory: Disabled
Console: hvc0
VerboseLogging: Disabled
CrashDump: /var/crash/mic 16GB
[wangx147@csp-015 ~]$
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page