- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a system with 2 PHI cards installed running on redhat 7.0. I am able to run code on the cards as pure offload and I can ssh into the cards. I am trying to get symmetric mode to work.
1) Does symmetric mode require OFED, or is OFED only required when there is a physical Infiniband card?
2) What are the proper steps to verify that the SCIF driver is properly loaded? mic shows up as a driver but there is no indication of anything named SCIF.
[root@infinity ~]# lsmod
Module Size Used by
mic 666166 16
vtsspp 372813 0
sep3_15 527535 0
pax 13181 0
bridge 115385 0
stp 12976 1 bridge
llc 14552 2 stp,bridge
ipt_REJECT 12541 2
xt_comment 12504 2
nf_conntrack_ipv4 14862 2
nf_defrag_ipv4 12729 1 nf_conntrack_ipv4
xt_conntrack 12760 2
nf_conntrack 105702 2 xt_conntrack,nf_conntrack_ipv4
iptable_filter 12810 1
ip_tables 27239 1 iptable_filter
intel_powerclamp 18764 0
coretemp 13435 0
intel_rapl 18773 0
kvm 461126 0
iTCO_wdt 13480 0
crct10dif_pclmul 14289 0
crc32_pclmul 13113 0
crc32c_intel 22079 0
ghash_clmulni_intel 13259 0
iTCO_vendor_support 13718 1 iTCO_wdt
cryptd 20359 1 ghash_clmulni_intel
mei_me 18646 0
sb_edac 26819 0
pcspkr 12718 0
nfsd 290215 13
mei 82723 1 mei_me
edac_core 57650 1 sb_edac
lpc_ich 21073 0
mfd_core 13435 1 lpc_ich
i2c_i801 18135 0
auth_rpcgss 59343 1 nfsd
nfs_acl 12837 1 nfsd
lockd 93977 1 nfsd
ipmi_si 53353 0
ipmi_msghandler 45603 1 ipmi_si
sunrpc 295293 15 nfsd,auth_rpcgss,lockd,nfs_acl
shpchp 37032 0
ioatdma 67762 0
acpi_power_meter 18087 0
acpi_pad 116305 0
ext4 562391 7
mbcache 14958 1 ext4
jbd2 102940 1 ext4
raid10 48128 2
sd_mod 45499 12
crc_t10dif 12714 1 sd_mod
crct10dif_common 12595 2 crct10dif_pclmul,crc_t10dif
ast 56119 1
syscopyarea 12529 1 ast
sysfillrect 12701 1 ast
sysimgblt 12640 1 ast
nvidia 8374856 0
drm_kms_helper 98226 1 ast
ttm 93488 1 ast
drm 311588 5 ast,ttm,drm_kms_helper,nvidia
igb 192078 0
ahci 29870 8
libahci 32009 1 ahci
ptp 18933 1 igb
libata 218854 2 ahci,libahci
pps_core 19106 1 ptp
dca 15130 2 igb,ioatdma
i2c_algo_bit 13413 2 ast,igb
i2c_core 40325 7 ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia
wmi 19070 0
dm_mirror 22135 0
dm_region_hash 20862 1 dm_mirror
dm_log 18411 2 dm_region_hash,dm_mirror
dm_mod 104038 25 dm_log,dm_mirror
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Dan,
You can run an MPI application in symmetric mode over TCP - OFED isn't required in this case (if you use Intel MPI Library just specify I_MPI_FABRICS=shm:tcp environment variable).
But for better performance it's recommended to use ibscif - this requires OFED (Infiniband* device isn't needed). See the Intel® Manycore Platform Software Stack (Intel® MPSS) User's Guide (chapter "Installing OFED with Intel® MPSS Support (optional)"). After the installation you will need to start mpss/openibd/ofed-mic services (see the instructions in the User's Guide). You can check the status of the scif device with ibv_devices/ibv_devinfo utilities (there may be some limitations for these utilities in some OFED versions). 'lsmod' should show corresponding scif modules. For this configuration you need to specify the following Intel MPI Library variables:
I_MPI_FABRICS=shm:dapl
I_MPI_DAPL_PROVIDER=ofa-v2-scif0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Artem,
I am still having issues. I was unable to compile the OFED drivers supplied by mpss because the kernel headers are now split into uapi directories, so the compiler can't see them. I tried linking the uapi header files, but I still got errors because "error: 'struct inet_sock' has no member named 'dport'", so I tried installing using the standard OFED libraries. I was able to get a clean install, but I can't get seem to get anything to run in symmetric mode.
I installed OFED-3.18-1-20150803-0846 With the following options:
./install.pl --with-xeon-phi --all --without-libfabric --without-libfabric-devel --without-fabtests --without-fabtests-debuginfo
It has to be compiled without libfabric because of this bug: http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2544 ;
[root@infinity OFED-3.18-1-20150803-0846]# ibv_devices
device node GUID
------ ----------------
scif0 4c79bafffe5a0099
[root@infinity OFED-3.18-1-20150803-0846]# ibv_devinfo
hca_id: scif0
transport: invalid transport (-1)
fw_ver: 0.0.1
node_guid: 4c79:baff:fe5a:0099
sys_image_guid: 4c79:baff:fe5a:0099
vendor_id: 0x8086
vendor_part_id: 0
hw_ver: 0x1
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1000
port_lmc: 0x00
link_layer: Unknown
[root@infinity-mic0 ~]# lsmod | grep -i scif
ibscif 68368 0
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif
pm_scif 4518 0
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras
dma_module 32560 2 micscif,intel_micveth
ringbuffer 2265 2 micscif,michvc
service openibd start
- starts fine
service ofed-mic start
- starts fine
service opensmd start
- Fails due to no local ports
[root@infinity OFED-3.18-1-20150803-0846]# cat /var/log/opensm.log
Aug 28 09:57:56 165541 [5ABD0740] 0x03 -> OpenSM 3.3.19
OpenSM 3.3.19
Aug 28 09:57:56 165598 [5ABD0740] 0x80 -> OpenSM 3.3.19
No local ports detected!
Aug 28 09:57:56 172566 [5ABD0740] 0x02 -> osm_vendor_init: 1000 pending umads specified
service mpxyd start
- starts fine
$ cat helloworldtest.sh
#!/bin/bash
echo "loading environment variables"
source /opt/intel/composer_xe_2015.2.164/bin/compilervars.sh intel64
source /opt/intel/impi/5.0.3.048/bin64/mpivars.sh
echo "Running the jobs"
export I_MPI_DEVICE=rdssm
export I_MPI_MIC=1
export I_MPI_FABRICS_LIST=shm:dapl
export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-scif0
export I_MPI_PIN_MODE=pm
export I_MPI_PIN_DOMAIN=auto
# export I_MPI_FALLBACK_DEVICE=0
mpirun -n 6 -host mic0 /data/mpirun/dan/helloworld.mic : -n 6 -host mic1 /data/mpirun/dan/helloworld.mic
mpirun -n 6 -host infinity /data/mpirun/dan/helloworld.host
[ddholstad@infinity dan]$ ./helloworldtest.sh
loading environment variables
Running the jobs
infinity-mic0.stat.uiowa.edu:SCM:1351:c3510b40: 247 us(247 us): open_hca: ibv_get_device_list() failed
[1] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
infinity-mic1.stat.uiowa.edu:SCM:1352:71e64b40: 250 us(250 us): open_hca: ibv_get_device_list() failed
infinity-mic0.stat.uiowa.edu:SCM:1355:713e4b40: 245 us(245 us): open_hca: ibv_get_device_list() failed
infinity-mic0.stat.uiowa.edu:SCM:1354:a1d67b40: 240 us(240 us): open_hca: ibv_get_device_list() failed
infinity-mic0.stat.uiowa.edu:SCM:1350:e9c2eb40: 354 us(354 us): open_hca: ibv_get_device_list() failed
infinity-mic1.stat.uiowa.edu:SCM:1355:14046b40: 237 us(237 us): open_hca: ibv_get_device_list() failed
infinity-mic0.stat.uiowa.edu:SCM:1352:8093b40: 625 us(625 us): open_hca: ibv_get_device_list() failed
infinity-mic0.stat.uiowa.edu:SCM:1353:36a04b40: 611 us(611 us): open_hca: ibv_get_device_list() failed
infinity-mic1.stat.uiowa.edu:SCM:1356:2c656b40: 244 us(244 us): open_hca: ibv_get_device_list() failed
[7] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
infinity-mic1.stat.uiowa.edu:SCM:1351:c8024b40: 566 us(566 us): open_hca: ibv_get_device_list() failed
infinity-mic1.stat.uiowa.edu:SCM:1353:518e6b40: 582 us(582 us): open_hca: ibv_get_device_list() failed
infinity-mic1.stat.uiowa.edu:SCM:1354:ffdb40: 599 us(599 us): open_hca: ibv_get_device_list() failed
[10] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[4] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[11] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[5] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[0] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[6] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[9] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[8] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
infinity.stat.uiowa.edu:SCM:8363:893edb40: 51 us(51 us): open_hca: ibv_get_device_list() failed
infinity.stat.uiowa.edu:SCM:8365:b4bbeb40: 49 us(49 us): open_hca: ibv_get_device_list() failed
infinity.stat.uiowa.edu:SCM:8364:fc131b40: 51 us(51 us): open_hca: ibv_get_device_list() failed
infinity.stat.uiowa.edu:SCM:8367:cd7aeb40: 69 us(69 us): infinity.stat.uiowa.edu:SCM:8368:7231db40: 59 us(59 us): open_hca: ibv_get_device_list() failed
infinity.stat.uiowa.edu:SCM:8366:9e91db40: 70 us(70 us): open_hca: ibv_get_device_list() failed
open_hca: ibv_get_device_list() failed
[0] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[2] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[1] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[3] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[4] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[5] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
$ cat helloworld.c
#include "stdio.h"
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int tid,nthreads;
char *cpu_name;
/* add in MPI startup routines */
/* 1st: launch the MPI processes on each node */
MPI_Init(&argc,&argv);
/* 2nd: request a thread id, sometimes called a "rank" from
* * the MPI master process, which has rank or tid == 0
* * */
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
/* 3rd: this is often useful, get the number of threads
* * or processes launched by MPI, this should be NCPUs-1
* * */
MPI_Comm_size(MPI_COMM_WORLD, &nthreads);
/* on EVERY process, allocate space for the machine name */
cpu_name = (char *)calloc(80,sizeof(char));
/* get the machine name of this particular host ... well
* * at least the first 80 characters of it ... */
gethostname(cpu_name,80);
printf("hello MPI user: from process = %i on machine=%s, of NCPU=%i processes\n",
tid, cpu_name, nthreads);
MPI_Finalize();
return(0);
}
Attempts to run the SCIF tutorial example code results in failure to connect...
Host:
./scif_connect_host -l 2048 -n mic0 -r 2048 -s 1024 -b block
scif_bind to port 2048 success
cannot bind multiple epd to a port : error 22
cannot bind epd to multiple ports : error 22
connection to node 0 failed : trial 20
connection to node 0 failed : trial 19
connection to node 0 failed : trial 18
connection to node 0 failed : trial 17
connection to node 0 failed : trial 16
connection to node 0 failed : trial 15
connection to node 0 failed : trial 14
connection to node 0 failed : trial 13
connection to node 0 failed : trial 12
connection to node 0 failed : trial 11
connection to node 0 failed : trial 10
connection to node 0 failed : trial 9
connection to node 0 failed : trial 8
connection to node 0 failed : trial 7
connection to node 0 failed : trial 6
connection to node 0 failed : trial 5
connection to node 0 failed : trial 4
connection to node 0 failed : trial 3
connection to node 0 failed : trial 2
connection to node 0 failed : trial 1
scif_connect failed with error 111
Mic0:
./scif_accept_mic -l 2048 -s 1024 -b block
scif_bind to port 2048 success
Any Idea what I'm missing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dan,
The error messages like:
infinity-mic0.stat.uiowa.edu:SCM:1351:c3510b40: 247 us(247 us): open_hca: ibv_get_device_list() failed
It usually means that ofed-mic service isn't run. Please double check that the service is running and/or try to restart it.
Regarding to the Intel MPI environment variables, the following ones should be enough for experiments:
export I_MPI_MIC=1
export I_MPI_FABRICS=shm:dapl
export I_MPI_DAPL_PROVIDER=ofa-v2-scif0
Also please make sure that the specified DAPL provider "ofa-v2-scif0" presents in the /etc/dat.conf.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ofed-mic is starting OK, but it looks like something goes wrong when I kick off the test script, ssh sessions to the mic cards drop, and the ibmodules status command on the mic cards shows no output. uptime on the MIC shows that it rebooted as a result of the code being run.
Before running the script:
/etc/init.d/ibmodules status
# /etc/init.d/ibmodules status
ibscif 68368 0 - Live 0xffffffffa00d2000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras, Live 0xffffffffa0027000
ib_uverbs 31639 1 rdma_ucm, Live 0xffffffffa00ec000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
ibp_client 24725 4 ibp_cm_client,ibp_sa_client,ibp_mlx4,ibp_mlx5, Live 0xffffffffa00f6000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras, Live 0xffffffffa0027000
ibp_mlx5 11376 0 - Live 0xffffffffa00ff000
ibp_client 24725 4 ibp_cm_client,ibp_sa_client,ibp_mlx4,ibp_mlx5, Live 0xffffffffa00f6000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
ibp_mlx4 28939 0 - Live 0xffffffffa0108000
ibp_client 24725 4 ibp_cm_client,ibp_sa_client,ibp_mlx4,ibp_mlx5, Live 0xffffffffa00f6000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
ibp_sa_client 16807 2 rdma_ucm,rdma_cm, Live 0xffffffffa0117000
ibp_client 24725 4 ibp_cm_client,ibp_sa_client,ibp_mlx4,ibp_mlx5, Live 0xffffffffa00f6000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras, Live 0xffffffffa0027000
ibp_cm_client 20523 1 rdma_cm, Live 0xffffffffa0122000
ibp_client 24725 4 ibp_cm_client,ibp_sa_client,ibp_mlx4,ibp_mlx5, Live 0xffffffffa00f6000
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras, Live 0xffffffffa0027000
ib_qib 22867 0 - Live 0xffffffffa012f000
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras, Live 0xffffffffa0027000
rdma_ucm 10243 0 - Live 0xffffffffa0153000
rdma_cm 23962 1 rdma_ucm, Live 0xffffffffa0146000
ibp_sa_client 16807 2 rdma_ucm,rdma_cm, Live 0xffffffffa0117000
ib_uverbs 31639 1 rdma_ucm, Live 0xffffffffa00ec000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
after running the script:
/etc/init.d/ibmodules status
[root@infinity-mic1 ~]#
<returns nothing>
[root@infinity ~]# cat /etc/dat.conf | grep scif0
ofa-v2-scif0 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "scif0 1" ""
ofa-v2-scif0-u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "scif0 1" ""
[root@infinity ~]# service ofed-mic status
Status of OFED Stack:
host [ OK ]
mic0 [ OK ]
mic1 [ OK ]
When I kick off a script with just the environment variables you listed, it looks like ofed-mic fails. The
[root@infinity log]# service ofed-mic status
Status of OFED Stack:
host [ OK ]
mic0 [FAILED]
mic1 [FAILED]
$ ./helloworldtest.sh
loading environment variables
Running the jobs
[mpiexec@infinity.stat.uiowa.edu] control_cb (../../pm/pmiserv/pmiserv_cb.c:773): connection to proxy 0 at host mic0 failed
[mpiexec@infinity.stat.uiowa.edu] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@infinity.stat.uiowa.edu] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@infinity.stat.uiowa.edu] main (../../ui/mpich/mpiexec.c:1059): process manager error waiting for completion
hello MPI user: from process = 0 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
hello MPI user: from process = 1 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
hello MPI user: from process = 2 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
hello MPI user: from process = 3 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
hello MPI user: from process = 4 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
hello MPI user: from process = 5 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Messages from /var/log/messages on the host indicating the mic card os crashed:
Aug 31 10:11:20 infinity.stat.uiowa.edu ntpd[5280]: Listen normally on 9 mic0:ib 192.0.2.100 UDP 123
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: micscif_handle_lostnode 1445 node 1
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: micscif_handle_lostnode 1445 node 2
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: micscif_handle_lostnode 1454 node 2 ready for crash dump!
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: mic1: Transition from state online to lost
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: micscif_handle_lostnode 1454 node 1 ready for crash dump!
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: mic0: Transition from state online to lost
Aug 31 10:17:30 infinity.stat.uiowa.edu kernel: micvnet_execute_stop: timeout waiting for link down message response
Aug 31 10:17:30 infinity.stat.uiowa.edu kernel: micvnet_execute_stop: timeout waiting for link down message response
Aug 31 10:17:30 infinity.stat.uiowa.edu kernel: br0: port 1(mic1) entered disabled state
Aug 31 10:17:31 infinity.stat.uiowa.edu kernel: br0: port 2(mic0) entered disabled state
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you pull back and use OFED-3.12-1? I could be wrong (I'm sure Artem can correct me on this, if I am) but I do not believe PSM was included in the 3.12-1 version. If you are using OFED only for communicating between the coprocessor and the host (in other words, if you do not have an InfiniBand adapter in your system), it shouldn't matter which version you are using and the problem won't exist in that older version. If you want to keep using the newer version, I think, based on the bug report you referenced, that you do need to install libfabric but first you install without it, then back up, restore the names of the rpm files that were renamed and install just libfabric. Of course, that could just be my creative reading of the bug report.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page