Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

MPI startup(): dapl fabric is not available and fallback fabric is not enabled

Vijay_Amirtharaj
Beginner
6,622 Views

Hi,

I am getting following error message.mpdboot is started in nodes.

wile running intel benchmark test.this error is coming.

[30] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[17] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[16] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[22] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[25] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[26] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[28] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[29] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[46] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[21] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[20] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[19] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[23] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[18] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[41] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[40] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[45] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[44] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[42] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[43] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[35] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[47] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[73] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[76] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[71] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[70] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[77] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[78] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[79] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[13] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[15] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[0] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[9] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[4] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[5] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[14] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[1] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[11] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[12] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[6] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[8] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[10] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[3] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[7] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[2] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[143] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[136] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[139] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[140] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[138] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[141] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[142] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[137] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
rank 143 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 143: return code 254
rank 142 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 142: return code 254
rank 141 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 141: return code 254
rank 140 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 140: return code 254
rank 139 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 139: return code 254
rank 138 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 138: return code 254
rank 79 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 79: return code 254
rank 76 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 76: return code 254
rank 73 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 73: killed by signal 9
rank 46 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 46: return code 254
rank 45 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 45: return code 254
rank 44 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 44: return code 254
rank 43 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 43: return code 254
rank 42 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 42: return code 254
rank 41 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 41: return code 254
rank 30 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 30: return code 254
rank 25 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 25: return code 254
rank 22 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 22: killed by signal 9
rank 21 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 21: return code 254
rank 17 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 17: killed by signal 9
rank 16 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 16: return code 254
rank 15 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 15: killed by signal 9
rank 13 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 13: return code 254
rank 12 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 12: return code 254
rank 10 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 10: return code 254
rank 9 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 9: return code 254
rank 8 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 8: return code 254
rank 6 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 6: return code 254
rank 4 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 4: return code 254
rank 3 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 3: return code 254
rank 0 in job 1  taavare.tuecms.com_44209   caused collective abort of all ranks
  exit status of rank 0: return code 254

0 Kudos
1 Solution
James_T_Intel
Moderator
6,622 Views

This problem was resolved by setting permissions on the InfiniBand* devices on all of the nodes.

[plain]chmod 666 /dev/infiniband/*[/plain]

View solution in original post

0 Kudos
5 Replies
James_T_Intel
Moderator
6,622 Views

Hi Vijay,

It seems you are specifying DAPL, but you do not have a DAPL provider.  What is the output from:

[plain]env | grep I_MPI[/plain]

If you have I_MPI_FABRICS set, try running without it.  By default, if I_MPI_FABRICS is set, I_MPI_FALLBACK will be disabled.  You can also set I_MPI_FALLBACK=1, which will enable fallback.  Let me know if this helps.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Vijay_Amirtharaj
Beginner
6,622 Views

Hi James,

Thanks for the replay, I will brief about the my setup then explain the issue.
      We have setup like infiniband(Mellanox) and Gigabit network.  When we are running without any value to I_MPI_DEVICE it is running fine.  But using the tcp network.
When we are trying to run with I_MPI_DEVICE = rdma:ofa-v2-ib0 or combination(rdssm,dapl etc) we are facing this issue.  Simple is that we are unable to use the infiniband fabric network.  The infiniband network is working fine and Lustre PFS is running on the same network.
We have run the ib testing(ibstat, ibdiagnet, ib_write_bw using two hosts, rdma_rw using two hosts) and dapltest are running fine.  Your reference I am sending the DEBUG log messages.

APL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ehca0-2
node1:c9c:bda7c700: 1216 us(366 us):  open_hca: device mthca0 not found
[56] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-2
node1:c9d:20562700: 730 us(730 us):  open_hca: dev open failed for mlx4_0, err=Permission denied
[57] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
node1:c9d:20562700: 853 us(123 us):  open_hca: dev open failed for mlx4_0, err=Permission denied
[57] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
CMA: unable to open RDMA device
[57] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib1
node1:c98:3eabe700: 1776 us(216 us): [51] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-1
 open_hca: device ipath0 not found
[52] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ehca0-2
node1:c97:6fb04700: 1865 us(698 us): node1:ca0:bd108700: 815 us(815 us): node1:c96:e98a8700: 2213 us(197 us): node1:c9c:bda7c700: 1298 us(82 us):  open_hca: device mthca0 not found
[56] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ipath0-1
node1:c9c:bda7c700: 1406 us(108 us):  open_hca: device ipath0 not found
[56] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ipath0-2
node1:c9e:41473700: 825 us(825 us):  open_hca: dev open failed for mlx4_0, err=Permission denied
node1:c98:3eabe700: 1977 us(201 us):  open_hca: device mthca0 not found
[59] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
node1:c9a:ea935700: 746 us(746 us):  open_hca: dev open failed for mlx4_0, err=Permission denied
[54] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
node1:c9a:ea935700: 933 us(187 us):  open_hca: dev open failed for mlx4_0, err=Permission denied
[54] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
CMA: unable to open RDMA device
[54] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib1
CMA: unable to open RDMA device
[54] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-1
node1:c9a:ea935700: 1621 us(688 us):  open_hca: device mthca0 not found
[54] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-2
 open_hca: device ehca0 not found
[52] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-iwarp
[51] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-2
node1:ca0:bd108700: 1053 us(238 us):  open_hca: dev open failed for mlx4_0, err=Permission denied


For your information our setup is running with out ofed stack.  All other necessaries available to function the infiniband network.

Regards,

Vijay Amirtharaj

0 Kudos
James_T_Intel
Moderator
6,622 Views

Hi Vijay,

Since you have submitted this issue through Intel® Premier Support, I'll handle it there.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
dingjun_chencmgl_ca
6,622 Views

Hi, James,

I have encountered the similar the question on our Windows PCs cluster. Could you also tell me the solution to above question? Thanks and I look forward to hearing from you.

Dingjun

0 Kudos
James_T_Intel
Moderator
6,623 Views

This problem was resolved by setting permissions on the InfiniBand* devices on all of the nodes.

[plain]chmod 666 /dev/infiniband/*[/plain]

0 Kudos
Reply