- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
We've got HPCC cluster using InfiniBand (Mellanox ConnectX2) with OFED 1.5 and Intel MPI 4.0
while running MPI binary, we got:
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
[10] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[15] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[9] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[12] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[13] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[11] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[8] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[14] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[8] MPI startup(): DAPL provider on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 8:wn2
[9] MPI startup(): DAPL provider on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 9:wn2
[10] MPI startup(): DAPL provider on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 10:wn2
[11] MPI startup(): DAPL provider on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 11:wn2
[14] MPI startup(): DAPL provider on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 14:wn2
[15] MPI startup(): DAPL provider on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 15:wn2
[12] MPI startup(): DAPL provider on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 12:wn2
[13] MPI startup(): DAPL provider on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 13:wn2
[0] dapl fabric is not available and fallback fabric is not enabled
[1] dapl fabric is not available and fallback fabric is not enabled
[2] dapl fabric is not available and fallback fabric is not enabled
[3] dapl fabric is not available and fallback fabric is not enabled
[6] dapl fabric is not available and fallback fabric is not enabled
[7] dapl fabric is not available and fallback fabric is not enabled
rank 7 in job 1 wn1_34032 caused collective abort of all ranks
exit status of rank 7: return code 254
[8] MPI startup(): shm and dapl data transfer modes
[15] MPI startup(): shm and dapl data transfer modes
rank 3 in job 1 wn1_34032 caused collective abort of all ranks
exit status of rank 3: return code 254
rank 2 in job 1 wn1_34032 caused collective abort of all ranks
exit status of rank 2: return code 254
rank 1 in job 1 wn1_34032 caused collective abort of all ranks
exit status of rank 1: return code 254
rank 0 in job 1 wn1_34032 caused collective abort of all ranks
exit status of rank 0: return code 254
dat.conf is the same on all compute nodes because it's on shared network system - NFS (diskless servers) and was nothing wrong with NFS then.
Here is my dat.conf:
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""ofa-v2-cma u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib0 0" ""ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
What could be a reason of this problem?There is a new syntax of I_MPI_DEVICE in Intel MPI 4.0, I was still using old syntax (I_MPI_DEVICE="rdssm:ofa-v2-mlx4_0-1u"). Should it work the same as in impi3?
Propably we should use new syntax but what are your suggested fabrics to use in this case and why? How "ofa" differs in real life from various dapl providers? As far I understand ofa doesn't use dapl at all?
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rafal,
In Intel MPI Library 4.0 you can use I_MPI_DEVICE but for rdma and rdssm fabrics only. ofa-v2-mlx4_0-1u should work with ofa fabric.
To use ofa fabric you need to set I_MPI_FABRICS=ofa or shm:ofa.
To use DAPL you need to set I_MPI_FABRICS=shm:dapl.
If your dat.conf file is not located in /etc directory, please use DAT_OVERRIDE env variable.
I hope this helps.
Regards!
Dmitry
In Intel MPI Library 4.0 you can use I_MPI_DEVICE but for rdma and rdssm fabrics only. ofa-v2-mlx4_0-1u should work with ofa fabric.
To use ofa fabric you need to set I_MPI_FABRICS=ofa or shm:ofa.
To use DAPL you need to set I_MPI_FABRICS=shm:dapl.
If your dat.conf file is not located in /etc directory, please use DAT_OVERRIDE env variable.
I hope this helps.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dmitry, thanks for clearing things out. Ofa interface is not well documented, but AFAIK it has multi-rail support which DAPL doesn't.
I believe if DAPL works ok there are no reasons to switch to ofa.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"If your dat.conf file is not located in /etc directory, please use DAT_OVERRIDE env variable."
My dat.conf was located in /etc everywhere but I was using binary compiled with MPI 3.2 with mpirun from MPI 4.
Could that be a reason of some problems? Is there something like binary compability between MPI 3.2 and MPI 4?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rafal,
Yeah, 4.0 should be binary compatible with 3.2 library, but I'd recommend recompilation if it's possible. Also you can check attached libraries by ldd command.
OFA supports multi-rail and you can use I_MPI_OFA_NUM_ADAPTERS variable to set number of interconnets on your nodes.
If you have multi-port cards you need to set I_MPI_OFA_NUM_PORTS env variable.
Please let me know if the issue still persists.
Regards!
Dmitry
Yeah, 4.0 should be binary compatible with 3.2 library, but I'd recommend recompilation if it's possible. Also you can check attached libraries by ldd command.
OFA supports multi-rail and you can use I_MPI_OFA_NUM_ADAPTERS variable to set number of interconnets on your nodes.
If you have multi-port cards you need to set I_MPI_OFA_NUM_PORTS env variable.
Please let me know if the issue still persists.
Regards!
Dmitry

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page