Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
17060 Discussions

TCP communication between multiple phi cards on the same host

j0e
New Contributor I
1,053 Views

I thought I was done with communication problems, but I guess not.  I just installed another three phi cards on a workstation, but I'm having the following communication problem.  I can ssh from the host to any phi card and back, but I can't ssh from one card to another, such as from mic0 to mic1.  I can't even ping one card from another card.  I assume that all cards should be able to communicate with all others on the same host. The default TCP addresses are being used (172.31.1.254 for host, 172.31.1.1 for mic0, 172.31.2.1 for mic1, etc).

Any ideas? Thanks!

0 Kudos
8 Replies
Frances_R_Intel
Employee
1,053 Views

All is explained in the Intel® Manycore Platform Software Stack (Intel® MPSS) Boot Configuration Guide (I am so glad the folks in the MPSS group put that manual together - I am always looking stuff up in that myself.) Section 4.4.4.4 on internal bridges. To get the coprocessors within a single host to talk to each other, you need to set up a internal bridge on the host. After you look at that section, go down to the section on micctrl and look at the --network option. You might find it easier to use that option than to change the micN.conf files.

 

0 Kudos
j0e
New Contributor I
1,053 Views

Thanks again Frances!  I guess i should RTFM.  I will try the micctrl option, but first I must take a cocktail break...

0 Kudos
Xiaoge_W_
Beginner
1,053 Views

I have a problem using two cards with host for MPI job. When I use host and one card, it works.

$ mpirun -host localhost -n 1 ./impact : -host mic0 -wdir /home/wangx147/impact_run/ -env LD_LIBRARY_PATH /home/wangx147/mic/lib -n 4 /home/wangx147/impact_run/xmain.MIC 

 # of lines before beam line elements:           11          13

 onblem =         1258

 Start simulation:

 nblem:         1258        1258

 

But if I use both cards, it hangs for a while and output error message. 

$ mpirun -host localhost -n 1 ./impact : -host mic0 -wdir /home/wangx147/impact_run/ -env LD_LIBRARY_PATH /home/wangx147/mic/lib -n 2 /home/wangx147/impact_run/xmain.MIC : -host mic1 -wdir /home/wangx147/impact_run/ -env LD_LIBRARY_PATH /home/wangx147/mic/lib -n 2 /home/wangx147/impact_run/xmain.MIC 

rank = 3, revents = 29, state = 1

rank = 3, revents = 29, state = 1

Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2973: (it_plfd->revents & POLLERR) == 0

internal ABORT - process 0

Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2973: (it_plfd->revents & POLLERR) == 0

internal ABORT - process 0

forrtl: error (69): process interrupted (SIGINT)

Image              PC                Routine            Line        Source             

xmain.MIC          00000000008DEBD1  Unknown               Unknown  Unknown

xmain.MIC          00000000008DCF67  Unknown             

......

Any suggestion? Thanks in advance!

 

0 Kudos
Frances_R_Intel
Employee
1,053 Views

You don't say if you are using tcp or ofed, but lets assume tcp. You can ssh from the host to either coprocessor, right? Have you tried to ssh into one of the coprocessors and then ssh from that coprocessor into the other? If not, make sure you have a bridge on the host connecting the host and both coprocessors and that the /etc/hosts file on each coprocessor has the correct address for the other one. The instructions for setting up the bridge is now in the MPSS Users Guide - look for micctrl Based Internal Bridge Configuration Implementation.

0 Kudos
Artem_R_Intel1
Employee
1,053 Views

Additionally I'd recommend to simplify your MPI command line by changing the MPI application with 'hostname' utility - just to check that there aren't any connectivity issues and MPI process launcher works fine.

As far as I remember such failure might happen on some network configurations (if not mistaken on static pair topology) if IP forwarding is disabled:

# cat /proc/sys/net/ipv4/ip_forward

0

You can enable it with the following command (host reboot will flush these settings to the default value):

# echo 1 > /proc/sys/net/ipv4/ip_forward

0 Kudos
Xiaoge_W_
Beginner
1,053 Views

Thank you, Frances.

tcp is used as far as I know. I can ssh from host to either card. But I can not ssh from one card to the other. I will read the MPSS Users Guide. By the way, How do I make sure if the bridge is correctly configured? Do I need to have root privilege?

Frances Roth (Intel) wrote:

You don't say if you are using tcp or ofed, but lets assume tcp. You can ssh from the host to either coprocessor, right? Have you tried to ssh into one of the coprocessors and then ssh from that coprocessor into the other? If not, make sure you have a bridge on the host connecting the host and both coprocessors and that the /etc/hosts file on each coprocessor has the correct address for the other one. The instructions for setting up the bridge is now in the MPSS Users Guide - look for micctrl Based Internal Bridge Configuration Implementation.

0 Kudos
Frances_R_Intel
Employee
1,053 Views

You need root privileges to set up a bridge but not to check on it. 

Use the command 'brctl show' and look for a bridge containing the mic0 and mic1 interfaces. If no bridge shows up, then someone with root privileges will need to set one up. Use the command 'ifconfig <bridge_name>' where <bridge_name> is the name of your bridge and look for the word 'UP'. If the output says the bridge is down, then someone with root privileges will need to bring the bridge up. If the bridge is up, check the /etc/hosts file on each coprocessor. The /etc/hosts file on each coprocessor must contain the name and address for the host and for the other coprocessor. If it does not, then someone with root privileges will need to fix the configuration. 

Hope that helps.

0 Kudos
Xiaoge_W_
Beginner
1,053 Views

Thank you very much for your reply, Frances.

Now the connection is reconfigured and I can ssh between cards and the host. But here come another problem and I don't know what is the source. I attached some output for your information. Is it mpirun problem that could be fix by using proper options?  

[wangx147@csp-015 ~]$ 

[wangx147@csp-015 ~]$ mpirun -host mic0 -n 5 -wdir /home/wangx147/impact_run/ -env LD_LIBRARY_PATH /home/wangx147/mic/lib /home/wangx147/impact_run/xmain.MIC

[proxy:0:0@csp-015-mic0] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "csp-015-mic0" to "192.168.6.79" (Network is unreachable)

[proxy:0:0@csp-015-mic0] main (../../pm/pmiserv/pmip.c:372): unable to connect to server 192.168.6.79 at port 36020 (check for firewalls!)

 

[wangx147@csp-015 ~]$ 

[wangx147@csp-015 ~]$ echo $SSH_CONNECTION

192.168.6.64 59197 192.168.6.79 22

[wangx147@csp-015 ~]$ 

[wangx147@csp-015 ~]$ ssh mic0

[wangx147@csp-015-mic0 ~]$ echo $SSH_CONNECTION

10.1.7.79 60220 10.1.7.100 22

[wangx147@csp-015-mic0 ~]$ 

 

[wangx147@csp-015-mic0 ~]$ exit

logout

Connection to mic0 closed.

[wangx147@csp-015 ~]$ micctrl --config

 

mic0:

=============================================================

    Config Version: 1.1

 

    Linux Kernel:   /usr/share/mpss/boot/bzImage-knightscorner

    BootOnStart:    Enabled

    Shutdowntimeout: 300 seconds

 

    ExtraCommandLine: highres=off

    PowerManagment: cpufreq_on;corec6_off;pc3_on;pc6_off

 

    Root Device:   Dynamic Ram Filesystem /var/mpss/mic0.image.gz from:

    Base:      CPIO /usr/share/mpss/boot/initramfs-knightscorner.cpio.gz

    Overlay    Filelist /opt/intel/mic/amplxe /opt/intel/mic/amplxe/amplxe.filelist on

    Overlay    Filelist /opt/intel/mic/vtsspp /opt/intel/mic/vtsspp/vtsspp.filelist on

    Overlay    Filelist /var/mpss/itt_lib /var/mpss/itt_lib/itt.filelist on

    CommonDir: Directory /var/mpss/common

    Micdir:    Directory /var/mpss/mic0

 

 

    Network:       Static bridge br0

        MIC IP:    10.1.7.100

        Host IP:   10.1.7.100

        Net Bits:  24

        NetMask:   255.255.255.0

        MtuSize:   1500

        Hostname:  csp-015-mic0

        MIC MAC:   4c:79:ba:32:04:fc

        Host MAC:  4c:79:ba:32:04:fd

 

    Cgroup:

        Memory:    Disabled

 

    Console:        hvc0

    VerboseLogging: Disabled

    CrashDump:      /var/crash/mic 16GB

 

mic1:

=============================================================

    Config Version: 1.1

 

    Linux Kernel:   /usr/share/mpss/boot/bzImage-knightscorner

    BootOnStart:    Enabled

    Shutdowntimeout: 300 seconds

 

    ExtraCommandLine: highres=off

    PowerManagment: cpufreq_on;corec6_off;pc3_on;pc6_off

 

    Root Device:   Dynamic Ram Filesystem /var/mpss/mic1.image.gz from:

    Base:      CPIO /usr/share/mpss/boot/initramfs-knightscorner.cpio.gz

    Overlay    Filelist /opt/intel/mic/amplxe /opt/intel/mic/amplxe/amplxe.filelist on

    Overlay    Filelist /opt/intel/mic/vtsspp /opt/intel/mic/vtsspp/vtsspp.filelist on

    Overlay    Filelist /var/mpss/itt_lib /var/mpss/itt_lib/itt.filelist on

    CommonDir: Directory /var/mpss/common

    Micdir:    Directory /var/mpss/mic1

 

    Network:       Static bridge br0

        MIC IP:    10.1.7.101

        Host IP:   10.1.7.100

        Net Bits:  24

        NetMask:   255.255.255.0

        MtuSize:   1500

        Hostname:  csp-015-mic1

        MIC MAC:   4c:79:ba:32:04:e6

        Host MAC:  4c:79:ba:32:04:e7

 

    Cgroup:

 

        Memory:    Disabled

 

    Console:        hvc0

    VerboseLogging: Disabled

    CrashDump:      /var/crash/mic 16GB

 

[wangx147@csp-015 ~]$ 

 

0 Kudos
Reply