Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

Error: Engine_connect for an offloading code - SCIF problems

Chris_Samuel
Beginner
427 Views

Hi there,

A user of ours is building a pre-release version of NAMD that includes Phi offloading support but when we try and run it it claims it cannot find the Phi cards.  I've also replicated the failure with xhpl_offload_intel64.

Reason: FATAL ERROR: MIC error on Pe 0 (barcoo062 device 0): No MIC devices found.

running with OFFLOAD_REPORT=2 reveals the following:

[SOURCE][0x9377bc80][1834028774450][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect
[SOURCE][0x9377bc80][2055063906528][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect
[SOURCE][0x9377bc80][2276654460069][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect
[SOURCE][0x9377bc80][2497819045011][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect

Running it under strace shows:

5672  open("/dev/mic/scif", O_RDWR)     = 3
5672  fcntl(3, F_SETFD, FD_CLOEXEC)     = 0
5672  fcntl(3, F_GETFD)                 = 0x1 (flags FD_CLOEXEC)
5672  fcntl(3, F_SETFD, FD_CLOEXEC)     = 0
5672  ioctl(3, 0xc0087301, 0x7fff1c780f38) = 0
[...]
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 10000000}, NULL)    = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 20000000}, NULL)    = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 40000000}, NULL)    = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 80000000}, NULL)    = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 160000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 320000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 640000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({1, 280000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({2, 560000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({5, 120000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({10, 240000000}, NULL)  = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({20, 480000000}, NULL)  = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({40, 960000000}, NULL)  = 0

At which point it writes out one of those errors and tries again.

I've also replicated this same problem with the xhpl_offload_intel64 which used to work under a previous install so I'd be curious if anyone knew what sort of things may have changed to cause this failure?

All the best,
Chris

0 Kudos
1 Reply
Chris_Samuel
Beginner
427 Views

Solved - the xCAT cluster management software was copying the passwd file from our management node onto the Xeon Phi cards and so there was no "micuser" user present which caused the coi_daemon to (quite legitimately) refuse to start.   Figuring out what was needed and creating that user on the management node fixed it.

0 Kudos
Reply