Software Archive
Read-only legacy content
17060 Discussions

Questions about SCIF Driver

Holstad__Dan
Beginner
759 Views

I have a system with 2 PHI cards installed running on redhat 7.0. I am able to run code on the cards as pure offload and I can ssh into the cards. I am trying to get symmetric mode to work.

1) Does symmetric mode require OFED, or is OFED only required when there is a physical Infiniband card?

2) What are the proper steps to verify that the SCIF driver is properly loaded? mic shows up as a driver but there is no indication of anything named SCIF. 

[root@infinity ~]# lsmod
Module                  Size  Used by
mic                   666166  16
vtsspp                372813  0
sep3_15               527535  0
pax                    13181  0
bridge                115385  0
stp                    12976  1 bridge
llc                    14552  2 stp,bridge
ipt_REJECT             12541  2
xt_comment             12504  2
nf_conntrack_ipv4      14862  2
nf_defrag_ipv4         12729  1 nf_conntrack_ipv4
xt_conntrack           12760  2
nf_conntrack          105702  2 xt_conntrack,nf_conntrack_ipv4
iptable_filter         12810  1
ip_tables              27239  1 iptable_filter
intel_powerclamp       18764  0
coretemp               13435  0
intel_rapl             18773  0
kvm                   461126  0
iTCO_wdt               13480  0
crct10dif_pclmul       14289  0
crc32_pclmul           13113  0
crc32c_intel           22079  0
ghash_clmulni_intel    13259  0
iTCO_vendor_support    13718  1 iTCO_wdt
cryptd                 20359  1 ghash_clmulni_intel
mei_me                 18646  0
sb_edac                26819  0
pcspkr                 12718  0
nfsd                  290215  13
mei                    82723  1 mei_me
edac_core              57650  1 sb_edac
lpc_ich                21073  0
mfd_core               13435  1 lpc_ich
i2c_i801               18135  0
auth_rpcgss            59343  1 nfsd
nfs_acl                12837  1 nfsd
lockd                  93977  1 nfsd
ipmi_si                53353  0
ipmi_msghandler        45603  1 ipmi_si
sunrpc                295293  15 nfsd,auth_rpcgss,lockd,nfs_acl
shpchp                 37032  0
ioatdma                67762  0
acpi_power_meter       18087  0
acpi_pad              116305  0
ext4                  562391  7
mbcache                14958  1 ext4
jbd2                  102940  1 ext4
raid10                 48128  2
sd_mod                 45499  12
crc_t10dif             12714  1 sd_mod
crct10dif_common       12595  2 crct10dif_pclmul,crc_t10dif
ast                    56119  1
syscopyarea            12529  1 ast
sysfillrect            12701  1 ast
sysimgblt              12640  1 ast
nvidia               8374856  0
drm_kms_helper         98226  1 ast
ttm                    93488  1 ast
drm                   311588  5 ast,ttm,drm_kms_helper,nvidia
igb                   192078  0
ahci                   29870  8
libahci                32009  1 ahci
ptp                    18933  1 igb
libata                218854  2 ahci,libahci
pps_core               19106  1 ptp
dca                    15130  2 igb,ioatdma
i2c_algo_bit           13413  2 ast,igb
i2c_core               40325  7 ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia
wmi                    19070  0
dm_mirror              22135  0
dm_region_hash         20862  1 dm_mirror
dm_log                 18411  2 dm_region_hash,dm_mirror
dm_mod                104038  25 dm_log,dm_mirror

 

0 Kudos
6 Replies
Artem_R_Intel1
Employee
759 Views

Hello Dan,

You can run an MPI application in symmetric mode over TCP - OFED isn't required in this case (if you use Intel MPI Library just specify I_MPI_FABRICS=shm:tcp environment variable).
But for better performance it's recommended to use ibscif - this requires OFED (Infiniband* device isn't needed). See the Intel® Manycore Platform Software Stack (Intel® MPSS) User's Guide (chapter "Installing OFED with Intel® MPSS Support (optional)"). After the installation you will need to start mpss/openibd/ofed-mic services (see the instructions in the User's Guide). You can check the status of the scif device with ibv_devices/ibv_devinfo utilities (there may be some limitations for these utilities in some OFED versions). 'lsmod' should show corresponding scif modules. For this configuration you need to specify the following Intel MPI Library variables:
I_MPI_FABRICS=shm:dapl
I_MPI_DAPL_PROVIDER=ofa-v2-scif0
 

0 Kudos
Holstad__Dan
Beginner
759 Views

Artem,

I am still having issues. I was unable to compile the OFED drivers supplied by mpss because the kernel headers are now split into uapi directories, so the compiler can't see them. I tried linking the uapi header files, but I still got errors because "error: 'struct inet_sock' has no member named 'dport'", so I tried installing using the standard OFED libraries. I was able to get a clean install, but I can't get seem to get anything to run in symmetric mode.

 

I installed OFED-3.18-1-20150803-0846 With the following options:
./install.pl --with-xeon-phi --all --without-libfabric --without-libfabric-devel --without-fabtests --without-fabtests-debuginfo

It has to be compiled without libfabric because of this bug: http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2544 ;

 

[root@infinity OFED-3.18-1-20150803-0846]# ibv_devices
    device                 node GUID
    ------              ----------------
    scif0               4c79bafffe5a0099

[root@infinity OFED-3.18-1-20150803-0846]# ibv_devinfo
hca_id:    scif0
    transport:            invalid transport (-1)
    fw_ver:                0.0.1
    node_guid:            4c79:baff:fe5a:0099
    sys_image_guid:            4c79:baff:fe5a:0099
    vendor_id:            0x8086
    vendor_part_id:            0
    hw_ver:                0x1
    phys_port_cnt:            1
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:        4096 (5)
            sm_lid:            1
            port_lid:        1000
            port_lmc:        0x00
            link_layer:        Unknown

[root@infinity-mic0 ~]# lsmod | grep -i scif
ibscif                 68368  0
ib_core                45501  9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif
pm_scif                 4518  0
micscif               283854  24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras
dma_module             32560  2 micscif,intel_micveth
ringbuffer              2265  2 micscif,michvc

service openibd start
  - starts fine
service ofed-mic start    
  - starts fine
service opensmd  start
  - Fails due to no local ports
    [root@infinity OFED-3.18-1-20150803-0846]# cat /var/log/opensm.log
    Aug 28 09:57:56 165541 [5ABD0740] 0x03 -> OpenSM 3.3.19
    OpenSM 3.3.19
    Aug 28 09:57:56 165598 [5ABD0740] 0x80 -> OpenSM 3.3.19
    No local ports detected!
    Aug 28 09:57:56 172566 [5ABD0740] 0x02 -> osm_vendor_init: 1000 pending umads specified

service mpxyd start
  - starts fine


$ cat helloworldtest.sh
#!/bin/bash
echo "loading environment variables"
source /opt/intel/composer_xe_2015.2.164/bin/compilervars.sh intel64
source /opt/intel/impi/5.0.3.048/bin64/mpivars.sh


echo "Running the jobs"
export I_MPI_DEVICE=rdssm
export I_MPI_MIC=1
export I_MPI_FABRICS_LIST=shm:dapl
export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-scif0
export I_MPI_PIN_MODE=pm
export I_MPI_PIN_DOMAIN=auto
# export I_MPI_FALLBACK_DEVICE=0
mpirun -n 6 -host mic0 /data/mpirun/dan/helloworld.mic : -n 6 -host mic1 /data/mpirun/dan/helloworld.mic
mpirun -n 6 -host infinity /data/mpirun/dan/helloworld.host

[ddholstad@infinity dan]$ ./helloworldtest.sh
loading environment variables
Running the jobs
infinity-mic0.stat.uiowa.edu:SCM:1351:c3510b40: 247 us(247 us):  open_hca: ibv_get_device_list() failed
[1] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
infinity-mic1.stat.uiowa.edu:SCM:1352:71e64b40: 250 us(250 us):  open_hca: ibv_get_device_list() failed
infinity-mic0.stat.uiowa.edu:SCM:1355:713e4b40: 245 us(245 us):  open_hca: ibv_get_device_list() failed
infinity-mic0.stat.uiowa.edu:SCM:1354:a1d67b40: 240 us(240 us):  open_hca: ibv_get_device_list() failed
infinity-mic0.stat.uiowa.edu:SCM:1350:e9c2eb40: 354 us(354 us):  open_hca: ibv_get_device_list() failed
infinity-mic1.stat.uiowa.edu:SCM:1355:14046b40: 237 us(237 us):  open_hca: ibv_get_device_list() failed
infinity-mic0.stat.uiowa.edu:SCM:1352:8093b40: 625 us(625 us):  open_hca: ibv_get_device_list() failed
infinity-mic0.stat.uiowa.edu:SCM:1353:36a04b40: 611 us(611 us):  open_hca: ibv_get_device_list() failed
infinity-mic1.stat.uiowa.edu:SCM:1356:2c656b40: 244 us(244 us):  open_hca: ibv_get_device_list() failed
[7] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
infinity-mic1.stat.uiowa.edu:SCM:1351:c8024b40: 566 us(566 us):  open_hca: ibv_get_device_list() failed
infinity-mic1.stat.uiowa.edu:SCM:1353:518e6b40: 582 us(582 us):  open_hca: ibv_get_device_list() failed
infinity-mic1.stat.uiowa.edu:SCM:1354:ffdb40: 599 us(599 us):  open_hca: ibv_get_device_list() failed
[10] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[4] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[11] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[5] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[0] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[6] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[9] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[8] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
infinity.stat.uiowa.edu:SCM:8363:893edb40: 51 us(51 us):  open_hca: ibv_get_device_list() failed
infinity.stat.uiowa.edu:SCM:8365:b4bbeb40: 49 us(49 us):  open_hca: ibv_get_device_list() failed
infinity.stat.uiowa.edu:SCM:8364:fc131b40: 51 us(51 us):  open_hca: ibv_get_device_list() failed
infinity.stat.uiowa.edu:SCM:8367:cd7aeb40: 69 us(69 us): infinity.stat.uiowa.edu:SCM:8368:7231db40: 59 us(59 us):  open_hca: ibv_get_device_list() failed
infinity.stat.uiowa.edu:SCM:8366:9e91db40: 70 us(70 us):  open_hca: ibv_get_device_list() failed
 open_hca: ibv_get_device_list() failed
[0] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[2] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[1] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[3] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[4] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[5] MPI startup(): dapl fabric is not available and fallback fabric is not enabled

$ cat helloworld.c
#include "stdio.h"
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
 int tid,nthreads;
 char *cpu_name;

  /* add in MPI startup routines */
  /* 1st: launch the MPI processes on each node */
  MPI_Init(&argc,&argv);

  /* 2nd: request a thread id, sometimes called a "rank" from
 *  *           the MPI master process, which has rank or tid == 0
 *   *              */
  MPI_Comm_rank(MPI_COMM_WORLD, &tid);

  /* 3rd: this is often useful, get the number of threads
 *  *           or processes launched by MPI, this should be NCPUs-1
 *   *              */
  MPI_Comm_size(MPI_COMM_WORLD, &nthreads);

  /* on EVERY process, allocate space for the machine name */
  cpu_name    = (char *)calloc(80,sizeof(char));

  /* get the machine name of this particular host ... well
 *  *      at least the first 80 characters of it ... */
  gethostname(cpu_name,80);

  printf("hello MPI user: from process = %i on machine=%s, of NCPU=%i processes\n",
         tid, cpu_name, nthreads);
  MPI_Finalize();
  return(0);
}

Attempts to run the SCIF tutorial example code results in failure to connect...

Host:
./scif_connect_host -l 2048 -n mic0 -r 2048 -s 1024 -b block

scif_bind to port 2048 success
cannot bind multiple epd to a port : error 22
cannot bind epd to multiple ports : error 22
connection to node 0 failed : trial 20
connection to node 0 failed : trial 19
connection to node 0 failed : trial 18
connection to node 0 failed : trial 17
connection to node 0 failed : trial 16
connection to node 0 failed : trial 15
connection to node 0 failed : trial 14
connection to node 0 failed : trial 13
connection to node 0 failed : trial 12
connection to node 0 failed : trial 11
connection to node 0 failed : trial 10
connection to node 0 failed : trial 9
connection to node 0 failed : trial 8
connection to node 0 failed : trial 7
connection to node 0 failed : trial 6
connection to node 0 failed : trial 5
connection to node 0 failed : trial 4
connection to node 0 failed : trial 3
connection to node 0 failed : trial 2
connection to node 0 failed : trial 1
scif_connect failed with error 111

 

Mic0:

  ./scif_accept_mic -l 2048 -s 1024 -b block

scif_bind to port 2048 success

 

 

Any Idea what I'm missing?

 

0 Kudos
Artem_R_Intel1
Employee
759 Views

Hi Dan,

The error messages like:

infinity-mic0.stat.uiowa.edu:SCM:1351:c3510b40: 247 us(247 us):  open_hca: ibv_get_device_list() failed

It usually means that ofed-mic service isn't run. Please double check that the service is running and/or try to restart it.

Regarding to the Intel MPI environment variables, the following ones should be enough for experiments:
export I_MPI_MIC=1
export I_MPI_FABRICS=shm:dapl
export I_MPI_DAPL_PROVIDER=ofa-v2-scif0

Also please make sure that the specified DAPL provider "ofa-v2-scif0" presents in the /etc/dat.conf.

0 Kudos
Holstad__Dan
Beginner
759 Views

Ofed-mic is starting OK, but it looks like something goes wrong when I kick off the test script, ssh sessions to the mic cards drop, and the ibmodules status command on the mic cards shows no output. uptime on the MIC shows that it rebooted as a result of the code being run.

Before running the script:

/etc/init.d/ibmodules status

# /etc/init.d/ibmodules status
ibscif 68368 0 - Live 0xffffffffa00d2000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras, Live 0xffffffffa0027000
ib_uverbs 31639 1 rdma_ucm, Live 0xffffffffa00ec000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
ibp_client 24725 4 ibp_cm_client,ibp_sa_client,ibp_mlx4,ibp_mlx5, Live 0xffffffffa00f6000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras, Live 0xffffffffa0027000
ibp_mlx5 11376 0 - Live 0xffffffffa00ff000
ibp_client 24725 4 ibp_cm_client,ibp_sa_client,ibp_mlx4,ibp_mlx5, Live 0xffffffffa00f6000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
ibp_mlx4 28939 0 - Live 0xffffffffa0108000
ibp_client 24725 4 ibp_cm_client,ibp_sa_client,ibp_mlx4,ibp_mlx5, Live 0xffffffffa00f6000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
ibp_sa_client 16807 2 rdma_ucm,rdma_cm, Live 0xffffffffa0117000
ibp_client 24725 4 ibp_cm_client,ibp_sa_client,ibp_mlx4,ibp_mlx5, Live 0xffffffffa00f6000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras, Live 0xffffffffa0027000
ibp_cm_client 20523 1 rdma_cm, Live 0xffffffffa0122000
ibp_client 24725 4 ibp_cm_client,ibp_sa_client,ibp_mlx4,ibp_mlx5, Live 0xffffffffa00f6000
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras, Live 0xffffffffa0027000
ib_qib 22867 0 - Live 0xffffffffa012f000
micscif 283854 24 ib_qib,ibp_cm_client,ibp_sa_client,ibp_client,ibscif,pm_scif,mpssboot,micras, Live 0xffffffffa0027000
rdma_ucm 10243 0 - Live 0xffffffffa0153000
rdma_cm 23962 1 rdma_ucm, Live 0xffffffffa0146000
ibp_sa_client 16807 2 rdma_ucm,rdma_cm, Live 0xffffffffa0117000
ib_uverbs 31639 1 rdma_ucm, Live 0xffffffffa00ec000
ib_core 45501 9 rdma_ucm,rdma_cm,iw_cm,ibp_sa_client,ibp_mlx4,ibp_mlx5,ibp_client,ib_uverbs,ibscif, Live 0xffffffffa00bc000

after running the script:

/etc/init.d/ibmodules status

[root@infinity-mic1 ~]#

<returns nothing>

 

[root@infinity ~]# cat /etc/dat.conf | grep scif0
ofa-v2-scif0 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "scif0 1" ""
ofa-v2-scif0-u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "scif0 1" ""

[root@infinity ~]# service ofed-mic status
Status of OFED Stack:
host                                                       [  OK  ]
mic0                                                       [  OK  ]
mic1                                                       [  OK  ]

When I kick off a script with just the environment variables you listed, it looks like ofed-mic fails. The
[root@infinity log]# service ofed-mic status
Status of OFED Stack:
host                                                       [  OK  ]
mic0                                                       [FAILED]
mic1                                                       [FAILED]

$ ./helloworldtest.sh
loading environment variables
Running the jobs
[mpiexec@infinity.stat.uiowa.edu] control_cb (../../pm/pmiserv/pmiserv_cb.c:773): connection to proxy 0 at host mic0 failed
[mpiexec@infinity.stat.uiowa.edu] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@infinity.stat.uiowa.edu] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@infinity.stat.uiowa.edu] main (../../ui/mpich/mpiexec.c:1059): process manager error waiting for completion
hello MPI user: from process = 0 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
hello MPI user: from process = 1 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
hello MPI user: from process = 2 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
hello MPI user: from process = 3 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
hello MPI user: from process = 4 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes
hello MPI user: from process = 5 on machine=infinity.stat.uiowa.edu, of NCPU=6 processes

0 Kudos
Holstad__Dan
Beginner
759 Views

Messages from /var/log/messages on the host indicating the mic card os crashed:

Aug 31 10:11:20 infinity.stat.uiowa.edu ntpd[5280]: Listen normally on 9 mic0:ib 192.0.2.100 UDP 123
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: micscif_handle_lostnode 1445 node 1
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: micscif_handle_lostnode 1445 node 2
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: micscif_handle_lostnode 1454 node 2 ready for crash dump!
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: mic1: Transition from state online to lost
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: micscif_handle_lostnode 1454 node 1 ready for crash dump!
Aug 31 10:17:26 infinity.stat.uiowa.edu kernel: mic0: Transition from state online to lost
Aug 31 10:17:30 infinity.stat.uiowa.edu kernel: micvnet_execute_stop: timeout waiting for link down message response
Aug 31 10:17:30 infinity.stat.uiowa.edu kernel: micvnet_execute_stop: timeout waiting for link down message response
Aug 31 10:17:30 infinity.stat.uiowa.edu kernel: br0: port 1(mic1) entered disabled state
Aug 31 10:17:31 infinity.stat.uiowa.edu kernel: br0: port 2(mic0) entered disabled state

 

0 Kudos
Frances_R_Intel
Employee
759 Views

Can you pull back and use OFED-3.12-1? I could be wrong (I'm sure Artem can correct me on this, if I am) but I do not believe PSM was included in the 3.12-1 version. If you are using OFED only for communicating between the coprocessor and the host (in other words, if you do not have an InfiniBand adapter in your system), it shouldn't matter which version you are using and the problem won't exist in that older version. If you want to keep using the newer version, I think, based on the bug report you referenced, that you do need to install libfabric but first you install without it, then back up, restore the names of the rpm files that were renamed and install just libfabric. Of course, that could just be my creative reading of the bug report.

0 Kudos
Reply