Software Archive
Read-only legacy content
17061 Discussions

Problem in offload to Intel MIC with Intel 15 Compiler

aketh_t_
Beginner
788 Views

Hi,

I recently updated my Intel compiler from Intel 14 to Intel 15 (Trail version).

I ran a cluster job on 8 nodes.

The program had an offload section to print "hi this is offload section"(The printing per node happens multiple times).

It seems like some nodes have printed the offload while others have thrown an error.

Here is the output/error I got.

offload error: cannot load library to the device 0 (error code 24)
/storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: symbol lookup error: /storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: undefined symbol: __offload_unregister_image

offload error: cannot load library to the device 0 (error code 24)
/storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: symbol lookup error: /storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: undefined symbol: __offload_unregister_image

offload error: cannot load library to the device 0 (error code 24)
/storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: symbol lookup error: /storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: undefined symbol: __offload_unregister_image

offload error: cannot load library to the device 0 (error code 24)
/storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: symbol lookup error: /storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: undefined symbol: __offload_unregister_image

offload error: cannot load library to the device 0 (error code 24)
/storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: symbol lookup error: /storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: undefined symbol: __offload_unregister_image

offload error: cannot load library to the device 0 (error code 24)
/storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: symbol lookup error: /storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: undefined symbol: __offload_unregister_image

offload error: cannot load library to the device 0 (error code 24)
/storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: symbol lookup error: /storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: undefined symbol: __offload_unregister_image

offload error: cannot load library to the device 0 (error code 24)
/storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: symbol lookup error: /storage/home/aketh/cesm/cases/B_intel15/exe/cesm.exe: undefined symbol: __offload_unregister_image
 hi this is the offload section
 hi this is the offload section
 hi this is the offload section
 hi this is the offload section
 hi this is the offload section
 hi this is the offload section
 hi this is the offload section
 hi this is the offload section
 hi this is the offload section
 hi this is the offload section
[122:node1] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
[98:node2] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 122
internal ABORT - process 98
[104:node2] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 104
[120:node1] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 120
[121:node1] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 121
[113:node1] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 113
[100:node2] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 100
[108:node2] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 108
[111:node2] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 111
[123:node1] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 123
 hi this is the offload section
[126:node1] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 126
 hi this is the offload section
[106:node2] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 106
 hi this is the offload section
[124:node1] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 124
[101:node2] unexpected disconnect completion event from [2:node8]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 101
 hi this is the offload section

0 Kudos
5 Replies
Loc_N_Intel
Employee
788 Views

Hi Aketh,

I try to figure out the issue you have, but I need more information:

- Did your application work with Intel 14 before you upgrade to Intel 15?

- Your application is a MPI program? If so what MPI version are you using?

- What MPSS are you using?

- What OS are you using?

Thanks 

0 Kudos
aketh_t_
Beginner
788 Views

OS linux

MPI 5.0

14 to 15. yes the app worked with 14 well.

MPSS Version : 3.2.1

 

0 Kudos
Loc_N_Intel
Employee
788 Views

Hi Alketh,

I notice that the MPSS version that you use is too old, you may consider to upgrade to a recent version (e.g., MPSS 3.4). Could you be more specific on the OS Linux (i.e., RHEL xxx)? What happens when you run the utility "miccheck" from host?

0 Kudos
aketh_t_
Beginner
788 Views

using CentOS.

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass
Executing default tests for device: 1
  Test 7 (mic1): Check device is in online state and its postcode is FF ... pass
  Test 8 (mic1): Check ras daemon is available in device ... pass
  Test 9 (mic1): Check running flash version is correct ... pass

 

0 Kudos
Loc_N_Intel
Employee
788 Views

Would you like to set the environment variable I_MPI_DEBUG 

# export  I_MPI_DEBUG=5

and run your program again please, this will display more debug information. Also, it is helpful to show the whole command line that executes your application.

0 Kudos
Reply