Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

bad filename - /etc/dat.conf

Rafał_Błaszczyk
1,383 Views
Hello,
We've got HPCC cluster using InfiniBand (Mellanox ConnectX2) with OFED 1.5 and Intel MPI 4.0
while running MPI binary, we got:
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
[10] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[15] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[9] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[12] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[13] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[11] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[8] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[14] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[8] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 8:wn2
[9] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 9:wn2
[10] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 10:wn2
[11] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 11:wn2
[14] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 14:wn2
[15] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 15:wn2
[12] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 12:wn2
[13] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 13:wn2
[0] dapl fabric is not available and fallback fabric is not enabled
[1] dapl fabric is not available and fallback fabric is not enabled
[2] dapl fabric is not available and fallback fabric is not enabled
[3] dapl fabric is not available and fallback fabric is not enabled
[6] dapl fabric is not available and fallback fabric is not enabled
[7] dapl fabric is not available and fallback fabric is not enabled
rank 7 in job 1  wn1_34032   caused collective abort of all ranks
  exit status of rank 7: return code 254 
[8] MPI startup(): shm and dapl data transfer modes
[15] MPI startup(): shm and dapl data transfer modes
rank 3 in job 1  wn1_34032   caused collective abort of all ranks
  exit status of rank 3: return code 254 
rank 2 in job 1  wn1_34032   caused collective abort of all ranks
  exit status of rank 2: return code 254 
rank 1 in job 1  wn1_34032   caused collective abort of all ranks
  exit status of rank 1: return code 254 
rank 0 in job 1  wn1_34032   caused collective abort of all ranks
  exit status of rank 0: return code 254 


dat.conf is the same on all compute nodes because it's on shared network system - NFS (diskless servers) and was nothing wrong with NFS then.

Here is my dat.conf:
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
ofa-v2-cma u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""

What could be a reason of this problem?

There is a new syntax of I_MPI_DEVICE in Intel MPI 4.0, I was still using old syntax (I_MPI_DEVICE="rdssm:ofa-v2-mlx4_0-1u"). Should it work the same as in impi3?
Propably we should use new syntax but what are your suggested fabrics to use in this case and why? How "ofa" differs in real life from various dapl providers? As far I understand ofa doesn't use dapl at all?
0 Kudos
4 Replies
Dmitry_K_Intel2
Employee
1,383 Views
Hi Rafal,

In Intel MPI Library 4.0 you can use I_MPI_DEVICE but for rdma and rdssm fabrics only. ofa-v2-mlx4_0-1u should work with ofa fabric.

To use ofa fabric you need to set I_MPI_FABRICS=ofa or shm:ofa.

To use DAPL you need to set I_MPI_FABRICS=shm:dapl.

If your dat.conf file is not located in /etc directory, please use DAT_OVERRIDE env variable.

I hope this helps.

Regards!
Dmitry
0 Kudos
Rafał_Błaszczyk
1,383 Views
Hi Dmitry, thanks for clearing things out. Ofa interface is not well documented, but AFAIK it has multi-rail support which DAPL doesn't.
I believe if DAPL works ok there are no reasons to switch to ofa.
Regards
0 Kudos
Rafał_Błaszczyk
1,383 Views
"If your dat.conf file is not located in /etc directory, please use DAT_OVERRIDE env variable."
My dat.conf was located in /etc everywhere but I was using binary compiled with MPI 3.2 with mpirun from MPI 4.
Could that be a reason of some problems? Is there something like binary compability between MPI 3.2 and MPI 4?
0 Kudos
Dmitry_K_Intel2
Employee
1,383 Views
Hi Rafal,

Yeah, 4.0 should be binary compatible with 3.2 library, but I'd recommend recompilation if it's possible. Also you can check attached libraries by ldd command.

OFA supports multi-rail and you can use I_MPI_OFA_NUM_ADAPTERS variable to set number of interconnets on your nodes.
If you have multi-port cards you need to set I_MPI_OFA_NUM_PORTS env variable.

Please let me know if the issue still persists.

Regards!
Dmitry
0 Kudos
Reply