Community
cancel
Showing results for 
Search instead for 
Did you mean: 
David_M_17
Beginner
176 Views

MPI generates numerous SCIF/scif_connect failure warning

 

I am running a heterogeneous job - on host and xeon phi coprocessor.  If I run the mpi job on just the host or just the card everything is smooth.  When I split the job between the host and the xeon phi card - the mpi run completes successfully, but it generates numerous warning messages and is quite noisy.   If the messages were meaningless - I would expect them not to be printed.  They are not fatal as the MPI messages all complete and the job completes too.   So what are the messages supposed to be warning me to do to improve the mpi environment?  I am running intel mpi 5.  The error messages are like this (my system is named delphi, the mic card is named mic0).   delphi-mic0:SCM:3a20:70677b80: 228 us(228 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 231 us(231 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 220 us(220 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 232 us(232 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:CMA:3a20:70677b80: 570 us(570 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
delphi-mic0:CMA:3a21:1ba1eb80: 808 us(808 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
delphi-mic0:CMA:3a20:70677b80: 554 us(554 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
delphi-mic0:CMA:3a21:1ba1eb80: 583 us(583 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
delphi-mic0:SCM:3a20:70677b80: 221 us(221 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 475 us(475 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 219 us(219 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 459 us(459 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 221 us(221 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 430 us(430 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 222 us(222 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 403 us(403 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 218 us(218 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 216 us(216 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:CMA:3a20:70677b80: 559 us(559 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
delphi-mic0:CMA:3a21:1ba1eb80: 729 us(729 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
delphi-mic0:UCM:3a20:70677b80: 207 us(207 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 193 us(193 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a20:70677b80: 200 us(200 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 200 us(200 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a20:70677b80: 218 us(218 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 303 us(303 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a20:70677b80: 198 us(198 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 232 us(232 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:CMA:3a21:1ba1eb80: 571 us(571 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
delphi-mic0:CMA:3a20:70677b80: 681 us(681 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
delphi-mic0:CMA:3a20:70677b80: 598 us(598 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?
delphi-mic0:CMA:3a21:1ba1eb80: 818 us(818 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?
delphi-mic0:SCM:3a20:70677b80: 224 us(224 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 223 us(223 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 226 us(226 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 231 us(231 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 225 us(225 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 229 us(229 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a20:70677b80: 195 us(195 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 200 us(200 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:CMA:3a20:70677b80: 564 us(564 us):  open_hca: getaddr_netdev ERROR:Cannot assign requested address. Is mic0:ib configured?
delphi-mic0:CMA:3a21:1ba1eb80: 621 us(621 us):  open_hca: getaddr_netdev ERROR:Cannot assign requested address. Is mic0:ib configured?
delphi-mic0:SCM:3a20:70677b80: 227 us(227 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 221 us(221 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 261 us(261 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 249 us(249 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 256 us(256 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 262 us(262 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 306 us(306 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 327 us(327 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a20:70677b80: 211 us(211 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 213 us(213 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a20:70677b80: 199 us(199 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 193 us(193 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 226 us(226 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 243 us(243 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 245 us(245 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 265 us(265 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 219 us(219 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 264 us(264 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 222 us(222 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 258 us(258 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a20:70677b80: 201 us(201 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 236 us(236 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a20:70677b80: 239 us(239 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 257 us(257 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a20:70677b80: 221 us(221 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a20:70677b80: 211 us(211 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 238 us(238 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:UCM:3a21:1ba1eb80: 296 us(296 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 225 us(225 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 227 us(227 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 223 us(223 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 279 us(279 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a20:70677b80: 232 us(232 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:SCM:3a21:1ba1eb80: 273 us(273 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:MCM:3a20:70677b80: 638 us(638 us): scif_connect() to port 68, failed with error Connection refused
delphi-mic0:MCM:3a20:70677b80: 731 us(93 us):  open_hca: SCIF init ERR on qib0
delphi-mic0:SCM:3a21:1ba1eb80: 267 us(267 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:MCM:3a20:70677b80: 667 us(667 us): scif_connect() to port 68, failed with error Connection refused
delphi-mic0:MCM:3a20:70677b80: 755 us(88 us):  open_hca: SCIF init ERR on qib0
delphi-mic0:SCM:3a21:1ba1eb80: 231 us(231 us):  open_hca: ibv_get_device_list() failed
delphi-mic0:MCM:3a20:70677b80: 845 us(845 us): scif_connect() to port 68, failed with error Connection refused
delphi-mic0:MCM:3a20:70677b80: 932 us(87 us):  open_hca: SCIF init ERR on qib1
delphi-mic0:MCM:3a21:1ba1eb80: 829 us(829 us): scif_connect() to port 68, failed with error Connection refused
delphi-mic0:MCM:3a21:1ba1eb80: 955 us(126 us):  open_hca: SCIF init ERR on qib0
delphi-mic0:MCM:3a20:70677b80: 555 us(555 us): scif_connect() to port 68, failed with error Connection refused
delphi-mic0:MCM:3a20:70677b80: 632 us(77 us):  open_hca: SCIF init ERR on qib1
delphi-mic0:MCM:3a21:1ba1eb80: 566 us(566 us): scif_connect() to port 68, failed with error Connection refused
delphi-mic0:MCM:3a21:1ba1eb80: 672 us(106 us):  open_hca: SCIF init ERR on qib0
delphi-mic0:MCM:3a21:1ba1eb80: 862 us(862 us): scif_connect() to port 68, failed with error Connection refused
delphi-mic0:MCM:3a21:1ba1eb80: 963 us(101 us):  open_hca: SCIF init ERR on qib1
delphi-mic0:MCM:3a21:1ba1eb80: 842 us(842 us): scif_connect() to port 68, failed with error Connection refused
delphi-mic0:MCM:3a21:1ba1eb80: 950 us(108 us):  open_hca: SCIF init ERR on qib1
 

 

 

 

 

0 Kudos
6 Replies
James_T_Intel
Moderator
176 Views

David,

Is the ofed-mic service running on Delphi-mic0?

James.

David_M_17
Beginner
176 Views

Hello James,   I am not sure.  "service --status-all" doesn't work on mic0 as service is not in /sbin on the mic0 Linux distro.    It may be that the ofed-mic service is not initiated.  How should I check and if it isn't how should I invoke it to start it?   Thank you.  -David

 

 

James_T_Intel
Moderator
176 Views

Use one of the following to determine the status:

ibv_devinfo
service ofed-mic status

If it isn't running, you'll need to start multiple services in order:

service openibd start
service opensmd start
service ofed-mic start
service mpxyd start

The last is only necessary if you are using a Mellanox* InfiniBand* adapter.

Artem_R_Intel1
Employee
176 Views

Hi David, James,

Small addition for the James' recommendation - service status should be checked on the HOST side.

BTW you can find the similar example in the Intel® MPI Library for Linux* OS Troubleshooting Guide.

David_M_17
Beginner
176 Views

Thanks all.  I wanted to let you know I read the comments.  This requires root access so I passed along the comments to the system administrator.   He reported that ofed had not been installed and it took some time for that task to rise to the top of his queue.   It rose.  Now he reports:  he was not successful in installing any version of OFED onto Delphi.  The most insightful error he received was "Kernel configuration is invalid".  "Unfortunately, the recommended fix for that error also failed, so I am pretty much stuck."   I know he just recently upgraded MPSS so that is a recent version.   What forum should ofed problems be posted in and I will direct him to that.   Thank you for your patience.

 

James_T_Intel
Moderator
176 Views

I'd start with the OpenFabrics Software User Community Portal.  That is at https://www.openfabrics.org/index.php/ofs-user-community.html.

Reply