Hi,

L__D__Marks · ‎03-18-2013

I have a cluster with some E5410 and some E5-2660 all infiniband connected with Intel impi. Everything is working, but the E5410 nodes are giving a lot of unwanted output of form (condensed as there is one entry for every core):

node04.cluster:723a:f24164b0: 1094 us(1094 us): open_hca: device mlx4_0 not found

node04.cluster:723a:f24164b0: 28485 us(28485 us): open_hca: getaddr_netdev ERROR: No such file or directory. Is ib0 configured?

node04.cluster:723a:f24164b0: 52940 us(24455 us): open_hca: getaddr_netdev ERROR: Cannot assign requested address. Is ib1 configured?

The mpi tasks are running fine, so this output is more annoying than a problem and there should be a way to avoid it. Suggestions?

James_T_Intel · ‎03-26-2013

Hi,

Please check the values of the environment variable I_MPI_DEBUG.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

L__D__Marks · ‎03-26-2013

env | grep -e MPI (bash)

Shows only I_MPI_HYDRA_DEBUG=0

James_T_Intel · ‎03-26-2013

Hi,

That seems odd. I_MPI_ROOT should usually be set, and setting I_MPI_HYDRA_DEBUG=0 is simply setting the default value. Are you setting I_MPI_DEBUG on the command line?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

James_T_Intel · ‎03-26-2013

Hi,

Actually, try unsetting DAPL_DBG. Those messages are not coming from the Intel® MPI Library, but from the DAPL provider.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

L__D__Marks · ‎03-26-2013

I_MPI_ROOT is set, but since you did not ask I did not mention that.

I am not setting I_MPI_DEBUG in the command line. Worth remembering, the output is only occuring on older E5410 machines, not on newer E5-2660

If relevant, uname -a returns (for the older then the newer):

Linux node01.cluster 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

Linux node20.cluster 2.6.32-279.9.1.el6.x86_64 #1 SMP Tue Sep 25 21:43:11 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

L__D__Marks · ‎03-26-2013

I set and exported the variables you suggested (the email you sent on the other thread):

env | grep -e I_MPI
I_MPI_DAPL_UD=enable
I_MPI_HYDRA_DEBUG=0
I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
I_MPI_DAPL_UD_RDMA_MIXED=enable[/plain]
I_MPI_ROOT=/opt/intel/impi/4.1.0.024

I still get the unwanted output. I did find one thing, if I just use one node then I do not get the output only if I use more than one of the older nodes. The unwanted output is easier to test with a short job (albeit still the complicated code).

James_T_Intel · ‎03-26-2013

Hi,

Those settings were for the other thread, regarding the slowdown. Here, try unsetting DAPL_DBG.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

L__D__Marks · ‎03-26-2013

DAPL_DBG is not set. Should it be set/unset in the command line?

I will try the other settings *for the other thread(, but the nodes are currently in use so it will be some time (a day or more) before I can test that. Worse, it takes 24hrs to do the test.

James_T_Intel · ‎03-26-2013

Hi,

It should not be set. Is it possible that the DAPL providers on the older nodes are compiled with debug information?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

L__D__Marks · ‎03-26-2013

The vendor of the cluster compiled OFED (which I think is what would be relevant, not sure). Prior to using impi I was using mvapich (and also openmpi) and with neither of these did I see anything similar, which suggests that debug information was not part of the compilation.

N.B., the older and newer nodes are on the same network with the same head node although they are physically connected to different switches with the newer switched daisy-chained to the older switch.

James_T_Intel · ‎03-26-2013

Hi,

Ok. I'll check with our DAPL developer to see if he has any ideas regarding why you would be getting these messages.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

L__D__Marks · ‎04-14-2013

The answer can be found at http://permalink.gmane.org/gmane.linux.drivers.rdma/4787, it appears that debug output was compiled in. I guess Intel mpi searches various options and if they fail moves on. Setting the environmental variable DAPL_DBG_TYPE 0 removes the output.

James_T_Intel · ‎04-18-2013

Understood. I'm glad everything is working correctly now.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

sara_a_ · ‎06-10-2016

I have an urgent question. I have a 6 nodes using centos 7 OS.

When I set I_MPI_DEBUG to 1, a simple MPI program (matrix multiplication) running on the CPUs only printed the following error:

[15] MPI startup(): cannot open dynamic library libdat2.so.2

To get rid of that error, I did : " yum install dapl-static.x86_64" on the master node only. This installed the missing libraries.

But, I got those unwanted output errors (here are a few lines):

marcher5:SCM:345a:44a23f80: 27 us(27 us): open_hca: ibv_get_device_list() failed
marcher5:SCM:345a:44a23f80: 29 us(29 us): open_hca: ibv_get_device_list() failed

...

Now, all MPI programs run normally but print those errors first. I can suppress them by setting the " DAPL_DBG_TYPE=0".

The question is, will this error affect the performance of MPI programs? I'm running SPEC MPI2007 benchmark and not getting a good speedup, could this be the cause? Please note I am not using Intel Xeon Phi at all.

James_T_Intel · ‎06-23-2016

Sara,

In the future, please start a new thread for separate issues. This allows us to better address and track issues.

These errors are in the fabric selection process. You should only see delays (and only minor) in the initialization step. If you look at the debug output with I_MPI_DEBUG set to at least 2, you will see the full fabric selection process. You can select the DAPL provider yourself using I_MPI_DAPL_PROVIDER, and this will bypass attempts to work with other providers. You can also reorder the entries in your /etc/dat.conf file to put the desired provider first.

James.

Unwanted output