Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
48 Views

Symmetric mode does not run and hangs mic

I have set up a system with Intel Xeon processors and Xeon Phi coprocessors under CentOS 7, kernel 3.10.0-327, Intel Composer Xe 2015.2.164. Installation os MPSS 3.6.1 went smoothly. Build up of MIC environment using nfs from host is just fine. I am able to compile MPI applications (e.g., https://software.intel.com/en-us/articles/using-the-intel-mpi-library-on-intel-xeon-phi-coprocessor...) for host and coprocessors (with -mmic).

Execution just on host or just on coprocessors goes well but execution on both at the same time (symmetric model) just hangs and if I wait long enough, the mic coprocessor goes dead, "mpps restart" and "micctrl -Rw" do not help and only a full system reboot can restore access to the coprocessors.

It seems like, during application execution, there is no MPI communication between host and coprocessors (yes, I_MPI_MIC=1 is set). Turning firewall off makes no difference. Please note that I can ssh (passwordless) to the coprocessors and launch the application from there with no problems. What could I possible be missing? Detailed info attached.

0 Kudos
12 Replies
Highlighted
Employee
48 Views

Hi Alex,

Could you please check that mic0 -> 200.18.41.27 SSH connection works fine?

Also please check the following test scenarios:
setenv I_MPI_MIC 1
mpirun -ppn 1 -n 2 -hosts thor,mic0 hostname

setenv I_MPI_MIC 1
setenv I_MPI_MIC_PREFIX $I_MPI_ROOT/mic/bin/
mpirun -ppn 1 -n 2 -hosts thor,mic0 IMB-MPI1 pingpong

0 Kudos
Highlighted
Beginner
48 Views

Hello Artem,

Setting "I_MPI_FABRIC=shm:tcp" seems to solve all issues.

The following text has been extracted from our private messages.

Thanks for the help.

Alex

===========================================================================================

Hello Artem,

Thanks for your reply.

ssh from coprocessors (mic0 and mic1) to host (thror) works just fine.

However

"mpirun -ppn 1 -n 2 -hosts thor,mic0 hostname"

does not work. I get:

[proxy:0:1@thor-mic0] HYDU_create_process (../../utils/launch/launch.c:588): execvp error on file hostname (No such file or directory)

On the other hand, executing hostname as "ssh -x mic0 hostname" works fine. I get:

thor-mic0

about

"setenv I_MPI_MIC_PREFIX $I_MPI_ROOT/mic/bin/ ; mpirun -ppn 1 -n 2 -hosts thor,mic0 IMB-MPI1 pingpong"

it hangs as originally described. Verbose output is attached.

Alex

===========================================================================================

Hi Alex, Could you please try the following scenarios and provide the output: setenv I_MPI_MIC 1 setenv I_MPI_MIC_PREFIX $I_MPI_ROOT/mic/bin/ setenv I_MPI_FABRICS=shm:tcp setenv I_MPI_DEBUG=100 mpirun -v -ppn 1 -n 2 -hosts thor,mic0 IMB-MPI1 pingpong and setenv I_MPI_MIC 1 setenv I_MPI_MIC_PREFIX $I_MPI_ROOT/mic/bin/ setenv I_MPI_DEBUG=100 mpirun -ppn 1 -n 2 -hosts thor,mic0 IMB-MPI1 pingpong BTW do you have any InfiniBand devices on the host?

===========================================================================================

Hello Artem,

Setting "I_MPI_FABRICS=shm:tcp" seems to solve all issues. "IMB-MPI1 pingpong" works as expected if this env. variable is set (and vice-versa). My original application (montecarlo.c) also works now in symmetric mode as expected.

Is there any comprehensive intel doc about these environment variables? I mean, there are no Infiniband devices on the host, no ofed drives has been installed and I have no idea I should do anything about customizing the fabric environment. 

Thanks for your help.

Alex

============================================================================================

Hi Alex,

You can find the detailed information about I_MPI_FABRICS (and other Intel MPI variables) in the Intel® MPI Library for Linux* OS Reference Manual. Could you please share this W/A in the topic (maybe it will help to anybody else in the future)?

For some reasons default fabric doesn't work on your host. Could you please try to run the following scenario and provide the output:
setenv I_MPI_MIC 1
setenv I_MPI_MIC_PREFIX $I_MPI_ROOT/mic/bin/
setenv I_MPI_DEBUG=100
mpirun -ppn 1 -n 2 -hosts thor,mic0 IMB-MPI1 pingpong 

============================================================================================

Hello Artem,

I am attaching the outputs of  "IMB-MPI1 pingpong" with and without I_MPI_FRABICS=shm:tcp

Alex

 

0 Kudos
Highlighted
Employee
48 Views

Hi Alex,

Could you please install OFED according the Intel® Manycore Platform Software Stack (Intel® MPSS) User's Guide (the chapter "Installing OFED with Intel® MPSS Support (optional)") and try the default MPI scenario (without I_MPI_FABRICS=shm:tcp).

OFED is required for optimal Intel® MPI Library work (even if there aren't any Intel® True Scale or Mellanox* InfiniBand* devices on the system). As far as I see this requirement isn't obvious from the related documentation - I'll submit an internal ticket to fix this.

0 Kudos
Highlighted
Beginner
48 Views

Hello Artem,

I agree it is not clear (from MPSS User's Guide) whether OFED should be installed or not in case one does not have InfiniBand devices installed. If it does improve perfomance it should allways be installed, right?

Now, Section 3.6 of MPSS User's Guide regarding OFED installation, is quite confusing. My OS is CentOS 7, kernel 3.10.0-327, but the guide recommends rebuilding the ofed rpm only for RHEL* 6.x and SLES* 11. It does not seem right...

Anyway, having rebuilded mpss-modules (as recommended) successfully, I tried to rebuild ofed-driver (in mpss-3.6.1/src) but if fails. I get several messages like

 #include <asm/types.h>
                       ^
compilation terminated.
In file included from include/linux/compiler.h:182:0,
                 from include/linux/linkage.h:4,
                 from include/linux/kernel.h:6,
                 from include/linux/interrupt.h:5,
                 from /root/rpmbuild/BUILD/ofed-driver/ofa_kernel-1.5.4.1/drivers/infiniband/ulp/sdp/sdp_rx.c:32:
include/uapi/linux/types.h:4:23: fatal error: asm/types.h: No such file or directory

This looks like the problem described in https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/518265, which seems to be an unresolved issue from a previous version of mpss.

Additionally, I am not sure if dapl is necessary or not but I also cannot rebuild it. I get several messages like

 #include <rdma/rdma_cma.h>
                           ^
compilation terminated.
In file included from ./dapl/openib_cma/dapl_ib_util.h:36:0,
                 from ./dapl/include/dapl.h:263,
                 from dapl/udapl/dapl_evd_query.c:39:
./dapl/openib_cma/linux/openib_osd.h:4:27: fatal error: rdma/rdma_cma.h: No such file or directory

On the other hand, libibscif rebuilds just fine.

Finally, in case I do get ofed-driver rebuiled for my OS+kernel, what installation instructions should I follow? I mean, what rpm should then be installed? The MPSS User's Guide gives details about that only for OFED+ (section 3.6.3)....

Alex

0 Kudos
Highlighted
Employee
48 Views

Hi Alex,

OFED is required to make Intel® MPI Library be operable with default settings (without additional variables like I_MPI_FABRICS). The default settings are recommended for most cases.

Regarding to the OFED installation instructions just follow the instructions for the appropriate OFED version in the MPSS User's Guide. As far as I see you tried OFED 1.5.4.1 so the chapter "3.6.4 Install OFED 1.5.4.1" should contain the instructions.

Also I'd recommend you to try newer OFED (for example, 3.18-1) - from my point of view its installation instructions are easier. Anyway check the release notes for the OFED to be installed because there may be problems if your OS isn't officially supported.

0 Kudos
Highlighted
Beginner
48 Views

Hello Artem,

Installing OFED 3.18.2 on my CenOS 7 fails on package compat-rdma (log attached). All other packages installs OK.

It looks like a silly warning messages taken too seriously. But I can't figure out how to turn it off...

Alex

0 Kudos
Highlighted
Employee
48 Views

Hi Alex,

compat-rdma package is not needed and can be ignored.

0 Kudos
Highlighted
Beginner
48 Views

Hello Artem,

I managed to install OFED 3.18.2 successfuly without the optional packages compat-rdma & compat-rdma-devel. I rebooted the system and...  nothing! I am not sure what services should be running since I don't have any infiniband hardware installed. In /etc/rc.d/rc5.d there are now the services "rdma-ndd", "srpd", "opensmd" and "ibacm". Of those, only opensmd is mentioned in the MPSS Users Guide (section 3.6.9 Starting OFED). "openibd" and "ofed-mic" are not around and "mpxyd" is in /etc/rc.d/init.d but is not linked to any startup sequence. The "opensmd" does not start. "systemctl status opensmd" shows

● opensmd.service - LSB: Activates/Deactivates InfiniBand Subnet Manager
   Loaded: loaded (/etc/rc.d/init.d/opensmd)
   Active: failed (Result: exit-code) since Fri 2016-01-29 16:53:10 BRST; 15min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 4272 ExecStart=/etc/rc.d/init.d/opensmd start (code=exited, status=1/FAILURE)

Jan 29 16:53:04 thor systemd[1]: Starting LSB: Activates/Deactivates InfiniBand Subnet Manager...
Jan 29 16:53:04 thor OpenSM[4278]: /var/log/opensm.log log file opened
Jan 29 16:53:04 thor OpenSM[4278]: OpenSM 3.3.19
Jan 29 16:53:10 thor opensmd[4272]: Starting IB Subnet Manager......[FAILED]
Jan 29 16:53:10 thor systemd[1]: opensmd.service: control process exited, code=exited status=1
Jan 29 16:53:10 thor systemd[1]: Failed to start LSB: Activates/Deactivates InfiniBand Subnet Manager.
Jan 29 16:53:10 thor systemd[1]: Unit opensmd.service entered failed state.
Jan 29 16:53:10 thor systemd[1]: opensmd.service failed.

 /var/log/opensm.log shows


Jan 29 12:49:21 374338 [E1644740] 0x03 -> OpenSM 3.3.19
OpenSM 3.3.19

Jan 29 12:49:21 374409 [E1644740] 0x80 -> OpenSM 3.3.19
ibwarn: [1362] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?
Jan 29 12:49:21 375199 [E1644740] 0x01 -> osm_vendor_init: ERR 5415: Error opening UMAD

Error from osm_opensm_init: IB_INSUFFICIENT_RESOURCES.
Jan 29 12:49:21 375219 [E1644740] 0x02 -> osm_vendor_init: 1000 pending umads specified
Jan 29 12:50:57 686088 [CC1B2740] 0x03 -> OpenSM 3.3.19
OpenSM 3.3.19

Jan 29 12:50:57 686192 [CC1B2740] 0x80 -> OpenSM 3.3.19
ibwarn: [1764] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?
Jan 29 12:50:57 687120 [CC1B2740] 0x01 -> osm_vendor_init: ERR 5415: Error opening UMAD

Error from osm_opensm_init: IB_INSUFFICIENT_RESOURCES.
Jan 29 12:50:57 687143 [CC1B2740] 0x02 -> osm_vendor_init: 1000 pending umads specified
Jan 29 12:51:58 797082 [90F04740] 0x03 -> OpenSM 3.3.19
OpenSM 3.3.19

Jan 29 12:51:58 797189 [90F04740] 0x80 -> OpenSM 3.3.19
ibwarn: [2093] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?
Jan 29 12:51:58 797960 [90F04740] 0x01 -> osm_vendor_init: ERR 5415: Error opening UMAD

Error from osm_opensm_init: IB_INSUFFICIENT_RESOURCES.
Jan 29 12:51:58 797982 [90F04740] 0x02 -> osm_vendor_init: 1000 pending umads specified
Jan 29 12:52:23 327067 [1FEB5740] 0x03 -> OpenSM 3.3.19
OpenSM 3.3.19

Jan 29 12:52:23 327391 [1FEB5740] 0x80 -> OpenSM 3.3.19
ibwarn: [2244] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?
Jan 29 12:52:23 328038 [1FEB5740] 0x01 -> osm_vendor_init: ERR 5415: Error opening UMAD

Error from osm_opensm_init: IB_INSUFFICIENT_RESOURCES.
Jan 29 12:52:23 328058 [1FEB5740] 0x02 -> osm_vendor_init: 1000 pending umads specified
Jan 29 14:57:49 827030 [3EBB0740] 0x03 -> OpenSM 3.3.19
OpenSM 3.3.19

Jan 29 14:57:49 827119 [3EBB0740] 0x80 -> OpenSM 3.3.19
ibwarn: [10448] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?
Jan 29 14:57:49 827870 [3EBB0740] 0x01 -> osm_vendor_init: ERR 5415: Error opening UMAD

Error from osm_opensm_init: IB_INSUFFICIENT_RESOURCES.
Jan 29 14:57:49 827905 [3EBB0740] 0x02 -> osm_vendor_init: 1000 pending umads specified
Jan 29 16:53:04 054840 [281A3740] 0x03 -> OpenSM 3.3.19
OpenSM 3.3.19

Jan 29 16:53:04 055102 [281A3740] 0x80 -> OpenSM 3.3.19
ibwarn: [4278] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?
Jan 29 16:53:04 056341 [281A3740] 0x01 -> osm_vendor_init: ERR 5415: Error opening UMAD

Error from osm_opensm_init: IB_INSUFFICIENT_RESOURCES.
Jan 29 16:53:04 056360 [281A3740] 0x02 -> osm_vendor_init: 1000 pending umads specified

 

0 Kudos
Highlighted
Beginner
48 Views

Hello Artem,

It seems I need a package like "ofed-driver" from mpss-3.6.1/ofed/modules to have the services "openibd" and "ofed-mic". But that brings me back to where we have started since I could not rpmbuild "ofed-driver" in my CentOS 7.2 system...

Alex

0 Kudos
Highlighted
Employee
48 Views

Hi Alex,

In your case (1 node without IB devices) OFED is needed mainly for ibscif.

From Intel® Manycore Platform Software Stack (Intel® MPSS) User's Guide:

2.2.2.3 Intel® MPSS Modules and Daemons:
Intel® MPSS also includes an optional InfiniBand* over SCIF (ibscif) driver which emulates an
InfiniBand* HCA to the higher levels of the OFED stack. This driver uses SCIF to provide high
BW, low latency communication between multiple Intel® Xeon Phi™ coprocessors in an Intel®
Xeon™ host platform, for example between MPI ranks on separate coprocessors.

3.6 Installing OFED with Intel® MPSS Support (optional):
The ibscif virtual adapter will provide the best host-to-coprocessor and
coprocessor-to-coprocessor transfer performance on systems without an InfiniBand* adapter.

So you need openibd, ofed-mic and (potentially) mpxyd - see the chapter "3.6.9 Starting OFED". Not sure whether these services are configured for the automatic startup within the OFED installation, so possibly you need to start/configure the startup manually.

 

 

0 Kudos
Highlighted
Beginner
48 Views

Chances are, you need to enable ip forwarding

/sbin/sysctl -w net.ipv4.ip_forward=1

 

0 Kudos
Highlighted
Beginner
48 Views

I don't know where this setting comes from but net.ipv4.ip_forward=1 is already on...

Would I need ipv6 enabled as well?

Alex

0 Kudos