Software Archive
Read-only legacy content
17061 Discussions

SCIF connection refused

oplehto
Beginner
596 Views

For some reason the SCIF interface in my compute nodes is refusing connections. Any ideas on what's wrong or where to start investigating:

The node has a Mellanox ConnectX-3 HCA with the latest Gold Update 2 MPSS and everything else set up "by the book". All the IB services and modules load nicely and seem to work and I can ssh into the MIC and run natively.

However, if I try to run an offload (LEO or OpenCL) application it hangs. Doing an strace reveals the following:

[plain]

mmap(NULL, 10489856, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f737396e000
mprotect(0x7f737396e000, 4096, PROT_NONE) = 0
clone(child_stack=0x7f737436dfd0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f737436e9d0, tls=0x7f737436e700, child_tidptr=0x7f737436e9d0) = 26801
open("/dev/mic/scif", O_RDWR)           = 5
fcntl(5, F_SETFD, FD_CLOEXEC)           = 0
ioctl(5, 0xc0087303, 0x7fffa02d2710)    = 0
futex(0x7f737436e9d0, FUTEX_WAIT, 26801, NULL) = 0
close(4)                                = 0
ioctl(3, 0xc0087303, 0x7fffa02d27d0)    = -1 ECONNREFUSED (Connection refused)
nanosleep({0, 10000000}, NULL)          = 0
ioctl(3, 0xc0087303, 0x7fffa02d27d0)    = -1 ECONNREFUSED (Connection refused)
nanosleep({0, 20000000}, NULL)          = 0
ioctl(3, 0xc0087303, 0x7fffa02d27d0)    = -1 ECONNREFUSED (Connection refused)
nanosleep({0, 40000000}, NULL)          = 0
[/plain]

0 Kudos
1 Reply
Olli-Pekka_L_
Beginner
596 Views

Pinpointed the problem: We use a slightly customized system for user management on the MICs and due to that the 'micuser' user was missing during mpssd and ofed-mic initialization. I now added the user and offloading seems to work again. Suggestion: It would be nice to have a sanity check for this.

Olli-Pekka

0 Kudos
Reply