- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
A user of ours is building a pre-release version of NAMD that includes Phi offloading support but when we try and run it it claims it cannot find the Phi cards. I've also replicated the failure with xhpl_offload_intel64.
Reason: FATAL ERROR: MIC error on Pe 0 (barcoo062 device 0): No MIC devices found.
running with OFFLOAD_REPORT=2 reveals the following:
[SOURCE][0x9377bc80][1834028774450][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect
[SOURCE][0x9377bc80][2055063906528][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect
[SOURCE][0x9377bc80][2276654460069][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect
[SOURCE][0x9377bc80][2497819045011][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect
Running it under strace shows:
5672 open("/dev/mic/scif", O_RDWR) = 3
5672 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
5672 fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
5672 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
5672 ioctl(3, 0xc0087301, 0x7fff1c780f38) = 0
[...]
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({0, 10000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({0, 20000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({0, 40000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({0, 80000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({0, 160000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({0, 320000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({0, 640000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({1, 280000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({2, 560000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({5, 120000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({10, 240000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({20, 480000000}, NULL) = 0
5672 ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672 nanosleep({40, 960000000}, NULL) = 0
At which point it writes out one of those errors and tries again.
I've also replicated this same problem with the xhpl_offload_intel64 which used to work under a previous install so I'd be curious if anyone knew what sort of things may have changed to cause this failure?
All the best,
Chris
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Solved - the xCAT cluster management software was copying the passwd file from our management node onto the Xeon Phi cards and so there was no "micuser" user present which caused the coi_daemon to (quite legitimately) refuse to start. Figuring out what was needed and creating that user on the management node fixed it.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page