Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Bryan_C_1
Beginner
452 Views

Intel MPI, DAPL and libdaplomcm

Jump to solution

Recently, we upgraded our system and have installed Mellanox OFED 2.2-1 in order to support native MPI calls between Xeon Phis.  Our system is a mixture of non-Phi nodes and Phi nodes.

In the course of the upgrade, it seems that something has changed with regard to how Intel MPI (v4.1.3) determines which DAPL provider to utilize for jobs that do not specify a fabric, fabric list or provider.  And even when the DAPL fabric is chosen (I_MPI_FABRICS = shm:dapl), we're getting a message if a specific provider isn't selected.  We do not set any default fabrics or providers via our modules.

The message is:  DAT: library load failure: libdaplomcm.so.2: cannot open shared object file: No such file or directory

This is occurring on non-Phi nodes.  MPSS is only installed on Phi nodes, and thus, libdaplomcm is only on Phi nodes.

According to the Intel MPI reference manual, IMPI will choose the first DAPL provider it finds, but the providers that involve libdaplomcm are all lower in the /etc/dat.conf file than the libdaploucm and libdaploscm providers, which are providers that we know work and are available in /usr/lib64.

Why is Intel MPI trying to utilize a provider that is listed below other providers?  What changed to make it attempt to use libdaplomcm and not the other providers that are available?

Anyone else seen something like this?

0 Kudos
1 Solution
Gergana_S_Intel
Employee
452 Views

Hey Bryan,

Ok, might have figured something out with the developers.  Please set "DAT_DBG_DEST=0" in your run and let me know if you still see the error.  You can do:

$ export DAT_DBG_DEST=0
$ mpirun -n 240 ./test

Thanks,
~Gergana

View solution in original post

13 Replies
Gergana_S_Intel
Employee
452 Views

Hey Bryan,

Thanks for getting in touch.  Since you've now installed MPSS and the associated OFED stack, Intel MPI is trying to load a couple of extra providers at startup in order to work with your Phi card.  Intel MPI 4.1.3 is the first version where the dapl auto provider mechanism was first introduced.  This new libdapl*mcm.so library is specific for communication with the Phi card (either from another Phi card or from the Xeon host).

Can you either copy/paste or send me the output of your application with I_MPI_DEBUG=5 set?  You can do that before running your application or on the mpirun command line (by adding -genv I_MPI_DEBUG 5).

If you've upgraded DAPL, I would also recommend upgrading to the latest Intel MPI 5.0.2, if possible.  You don't have to rebuild your application, just update the runtimes since they're backwards compatible.

Thanks and I'll wait to hear back.

Regards,
~Gergana

 

Bryan_C_1
Beginner
452 Views

Gergana,

Thanks for the response!  Here is what our output looks like with Intel MPI 4.1.3, I_MPI_DEBUG = 5 and no additional IMPI variables set.  As you can see, the job finishes, but the 320 "DAT" messages are disconcerting.  It's very easy for someone to confuse that message as the cause for a job failing, when in fact, it might be something else.

Executing where_mpi_test (mpirun ./where_mpi.exe)

[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[0] MPI startup(): 0       17126    node2     {0,16}
[0] MPI startup(): 10      17136    node2     {10,26}
[0] MPI startup(): 11      17137    node2     {11,27}
[0] MPI startup(): 1       17127    node2     {1,17}
[0] MPI startup(): 12      17138    node2     {12,28}
[0] MPI startup(): 13      17139    node2     {13,29}
[0] MPI startup(): 14      17140    node2     {14,30}
[0] MPI startup(): 15      17141    node2     {15,31}
[0] MPI startup(): 16      31550    node1       {0,16}
[0] MPI startup(): 17      31551    node1       {1,17}
[0] MPI startup(): 18      31552    node1       {2,18}
[0] MPI startup(): 19      31553    node1       {3,19}
[0] MPI startup(): 20      31554    node1       {4,20}
[0] MPI startup(): 21      31555    node1       {5,21}
[0] MPI startup(): 2       17128    node2     {2,18}
[0] MPI startup(): 22      31556    node1       {6,22}
[0] MPI startup(): 23      31557    node1       {7,23}
[0] MPI startup(): 24      31558    node1       {8,24}
[0] MPI startup(): 25      31559    node1       {9,25}
[0] MPI startup(): 26      31560    node1       {10,26}
[0] MPI startup(): 27      31561    node1       {11,27}
[0] MPI startup(): 28      31562    node1       {12,28}
[0] MPI startup(): 29      31563    node1       {13,29}
[0] MPI startup(): 30      31564    node1       {14,30}
[0] MPI startup(): 31      31565    node1       {15,31}
[0] MPI startup(): 3       17129    node2     {3,19}
[0] MPI startup(): 4       17130    node2     {4,20}
[0] MPI startup(): 5       17131    node2     {5,21}
[0] MPI startup(): 6       17132    node2     {6,22}
[0] MPI startup(): 7       17133    node2     {7,23}
[0] MPI startup(): 8       17134    node2     {8,24}
[0] MPI startup(): 9       17135    node2     {9,25}
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10,11,11,10
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx4_0:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=16:0 0,1 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 11,12 12,13 13,14 14,15 15
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[10] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[10] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[10] MPI startup(): shm and dapl data transfer modes
[11] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[11] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[11] MPI startup(): shm and dapl data transfer modes
[12] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[12] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[12] MPI startup(): shm and dapl data transfer modes
[13] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[13] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[13] MPI startup(): shm and dapl data transfer modes
[14] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[14] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[14] MPI startup(): shm and dapl data transfer modes
[15] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[15] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[15] MPI startup(): shm and dapl data transfer modes
[16] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[16] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[16] MPI startup(): shm and dapl data transfer modes
[17] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[17] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[17] MPI startup(): shm and dapl data transfer modes
[18] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[18] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[18] MPI startup(): shm and dapl data transfer modes
[19] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[19] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[19] MPI startup(): shm and dapl data transfer modes
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[1] MPI startup(): shm and dapl data transfer modes
[20] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[20] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[20] MPI startup(): shm and dapl data transfer modes
[21] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[21] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[21] MPI startup(): shm and dapl data transfer modes
[22] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[22] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[22] MPI startup(): shm and dapl data transfer modes
[23] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[23] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[23] MPI startup(): shm and dapl data transfer modes
[24] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[24] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[24] MPI startup(): shm and dapl data transfer modes
[25] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[25] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[25] MPI startup(): shm and dapl data transfer modes
[26] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[26] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[26] MPI startup(): shm and dapl data transfer modes
[27] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[27] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[27] MPI startup(): shm and dapl data transfer modes
[28] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[28] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[28] MPI startup(): shm and dapl data transfer modes
[29] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[29] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[29] MPI startup(): shm and dapl data transfer modes
[2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[2] MPI startup(): shm and dapl data transfer modes
[30] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[30] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[30] MPI startup(): shm and dapl data transfer modes
[31] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[31] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[31] MPI startup(): shm and dapl data transfer modes
[3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[3] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[3] MPI startup(): shm and dapl data transfer modes
[4] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[4] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[4] MPI startup(): shm and dapl data transfer modes
[5] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[5] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[5] MPI startup(): shm and dapl data transfer modes
[6] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[6] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[6] MPI startup(): shm and dapl data transfer modes
[7] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[7] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[7] MPI startup(): shm and dapl data transfer modes
[8] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[8] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[8] MPI startup(): shm and dapl data transfer modes
[9] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[9] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[9] MPI startup(): shm and dapl data transfer modes

DAT: library load failure: libdaplomcm.so.2: cannot open shared object file: No such file or directory
DAT: library load failure: libdaplomcm.so.2: cannot open shared object file: No such file or directory
DAT: library load failure: libdaplomcm.so.2: cannot open shared object file: No such file or directory

... repeated 10 times per task (320 total lines) ...

Rank 0 on Node node2
Rank 1 on Node node2
Rank 2 on Node node2
Rank 3 on Node node2
Rank 4 on Node node2
Rank 5 on Node node2
Rank 6 on Node node2
Rank 7 on Node node2
Rank 8 on Node node2
Rank 9 on Node node2
Rank 10 on Node node2
Rank 11 on Node node2
Rank 12 on Node node2
Rank 13 on Node node2
Rank 14 on Node node2
Rank 15 on Node node2
Rank 16 on Node node1
Rank 17 on Node node1
Rank 18 on Node node1
Rank 19 on Node node1
Rank 20 on Node node1
Rank 21 on Node node1
Rank 22 on Node node1
Rank 23 on Node node1
Rank 24 on Node node1
Rank 25 on Node node1
Rank 26 on Node node1
Rank 27 on Node node1
Rank 28 on Node node1
Rank 29 on Node node1
Rank 30 on Node node1
Rank 31 on Node node1

 

Thanks!

- Bryan

Gergana_S_Intel
Employee
452 Views

Thanks for the output, Bryan.  I see you're running on 2 nodes.  Are those the only 2 hosts listed in your hosts file?  Or are you running under a job scheduler and that's how mpirun is getting the hosts?

Can you also do "cat /etc/dat.conf" for me?  Sorry, should have asked for this in my earlier response.

I'm trying to figure out the order in which Intel MPI checks the available providers.  It should go down the dat.conf list (as you mention above).  But if the MPSS service is running on your cluster and you have Phi cards installed, it'll try to open up the MIC-specific providers (like lib*mcm).

We can simply comment out those lines in dat.conf.  But, if you have plans of running on the Phi cards in the future, that might not be the best option.

Regards,
~Gergana

Bryan_C_1
Beginner
452 Views

Gergana,

Thanks for the quick response!  I'm only running this quick test job on two nodes, but there are a large number of nodes available.  And yes, it's being run under a scheduler that has a wrapper around mpirun.

We have some nodes that have Phi cards and a vast majority of the nodes do not.  MPSS is only installed on the Phi nodes.  Does MPSS need to be on every node, regardless if it has a Phi or not?  It doesn't seem that the MPSS service would be running on a non-Phi node if it's not installed there, which is what sorta confused me about why IMPI would attempt to use libdaplomcm ahead of other providers.

I was thinking we could comment out the libdaplomcm entries on the non-Phi nodes, but then we'd have to keep a separate /etc/dat.conf version for the Phi nodes, correct?

Here is our /etc/dat.conf file:

# DAT v2.0, v1.2 configuration file
#  
# Each entry should have the following fields:
#
# <ia_name> <api_version> <threadsafety> <default> <lib_path> \
#           <provider_version> <ia_params> <platform_params>
#
# For uDAPL cma provder, <ia_params> is one of the following:
#       network address, network hostname, or netdev name and 0 for port
#
# For uDAPL scm provider, <ia_params> is device name and port
# For uDAPL ucm provider, <ia_params> is device name and port
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0 
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0 
# For uDAPL RoCE provider, <ia_params> is device name and 0 
# 

ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mcm-1 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mcm-2 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-scif0 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "scif0 1" ""
ofa-v2-scif0-u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "scif0 1" ""
ofa-v2-mic0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mic0:ib 1" ""
ofa-v2-mlx4_0-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mlx4_1-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 1" ""
ofa-v2-mlx4_1-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 2" ""
ofa-v2-mlx4_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_1 1" ""
ofa-v2-mlx4_1-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_1 2" ""
ofa-v2-mlx4_0-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mlx4_1-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_1 1" ""
ofa-v2-mlx4_1-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_1 2" ""
ofa-v2-mlx5_0-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 1" ""
ofa-v2-mlx5_0-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 2" ""
ofa-v2-mlx5_1-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_1 1" ""
ofa-v2-mlx5_1-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_1 2" ""
ofa-v2-mlx5_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 1" ""
ofa-v2-mlx5_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 2" ""
ofa-v2-mlx5_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_1 1" ""
ofa-v2-mlx5_1-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_1 2" ""
ofa-v2-mlx5_0-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_0 1" ""
ofa-v2-mlx5_0-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_0 2" ""
ofa-v2-mlx5_1-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_1 1" ""
ofa-v2-mlx5_1-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_1 2" ""

 

Thanks,
Bryan

 

Gergana_S_Intel
Employee
452 Views

Hi Bryan,

You're correct, MPSS doesn't have to be installed on all the nodes of your cluster, only on the ones that have Phi cards.  And you should not have to change individual /etc/dat.conf files - we handle heterogeneous clusters and it would be bad engineering on our part if we make you keep 2 copies of the system files :)

Re-reading through the thread above, your real worry is that sometimes you see the "DAT: library load failure: libdaplomcm.so.2: cannot open shared object file: No such file or directory" message displayed and you want to suppress those so as not to confuse your users.

I'm checking with the MPSS team and our internal developers to see if that library really is needed on non-Phi nodes and/or of there's a way to suppress the messages.  I'll let you know.

In the meantime, can you run a quick experiment for me and upgrade to the latest Intel MPI Library 5.0.3 (you can grab it from the Intel Registration Center)?  You can install under $HOME as a user and run a quick job.  Let me know if the outcome is the same.

Regards,
~Gergana

Bryan_C_1
Beginner
452 Views

Gergana,

Our admins had just recently installed IMPI 5.0.2 based on your feedback earlier.  I was able to run my simple MPI job via that version of IMPI without recompiling it, but it still gave the "DAT:" messages with default settings.

I'll try to get a copy of 5.0.3 installed and see how that version does.

Thanks,
Bryan

Gergana_S_Intel
Employee
452 Views

Thanks, Bryan.  I wouldn't worry about installing 5.0.3 unless you really want to.  If you've already tested with 5.0.2, that should be enough.

I'm currently having an email discussion with our developers on this.  There might not be an easy fix to suppress these messages.  I'll update you again once I have more info.

Regards,
~Gergana

Gergana_S_Intel
Employee
453 Views

Hey Bryan,

Ok, might have figured something out with the developers.  Please set "DAT_DBG_DEST=0" in your run and let me know if you still see the error.  You can do:

$ export DAT_DBG_DEST=0
$ mpirun -n 240 ./test

Thanks,
~Gergana

View solution in original post

Bryan_C_1
Beginner
452 Views

Gergana,

That worked!  Setting DAT_DBG_DEST to 0 removed the 'DAT:' messages from the output.  Is this something we need to incorporate into our IMPI modules, or is there an IMPI configuration file that we could set this in?

Thanks,
Bryan

Gergana_S_Intel
Employee
452 Views

Success!  You can incorporate it into your modules so it's being set when Intel MPI is loaded.

The other option might be add it to the mpiexec.conf file housed in <intelmpi_install_dir>/etc64.  Just add "-genv DAT_DBG_DEST 0" to a new line in that file.  To be completely honest, I haven't tested this so give it a try under your user account first.

From our side, we're adding this additional check within Intel MPI so you shouldn't have to do that in the future.  Not sure yet in which version it'll be implemented.

Let me know how this works out.

~Gergana

Bryan_C_1
Beginner
452 Views

Thanks so much Gergana!  I'm going to get this tested out and implemented for all of our different IMPI versions.

Bryan_C_1
Beginner
452 Views

I've tested putting this into $MPIHOME/etc64/mpiexec.conf as you suggested, and it worked with both IMPI 4.1.3 and 5.0.2.  This is a much cleaner solution and should remove the warning messages.  Thanks so much!

Gergana_S_Intel
Employee
452 Views

Great to hear!  I've also submitted an internal bug request to our development team.  Post again if you hit any other issues.

All the best,
~Gergana

Reply