- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a cluster with some E5410 and some E5-2660 all infiniband connected with Intel impi. Everything is working, but the E5410 nodes are giving a lot of unwanted output of form (condensed as there is one entry for every core):
node04.cluster:723a:f24164b0: 1094 us(1094 us): open_hca: device mlx4_0 not found
node04.cluster:723a:f24164b0: 28485 us(28485 us): open_hca: getaddr_netdev ERROR: No such file or directory. Is ib0 configured?
node04.cluster:723a:f24164b0: 52940 us(24455 us): open_hca: getaddr_netdev ERROR: Cannot assign requested address. Is ib1 configured?
The mpi tasks are running fine, so this output is more annoying than a problem and there should be a way to avoid it. Suggestions?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please check the values of the environment variable I_MPI_DEBUG.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
env | grep -e MPI (bash)
Shows only I_MPI_HYDRA_DEBUG=0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
That seems odd. I_MPI_ROOT should usually be set, and setting I_MPI_HYDRA_DEBUG=0 is simply setting the default value. Are you setting I_MPI_DEBUG on the command line?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Actually, try unsetting DAPL_DBG. Those messages are not coming from the Intel® MPI Library, but from the DAPL provider.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I_MPI_ROOT is set, but since you did not ask I did not mention that.
I am not setting I_MPI_DEBUG in the command line. Worth remembering, the output is only occuring on older E5410 machines, not on newer E5-2660
If relevant, uname -a returns (for the older then the newer):
Linux node01.cluster 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
Linux node20.cluster 2.6.32-279.9.1.el6.x86_64 #1 SMP Tue Sep 25 21:43:11 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I set and exported the variables you suggested (the email you sent on the other thread):
env | grep -e I_MPI
I_MPI_DAPL_UD=enable
I_MPI_HYDRA_DEBUG=0
I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
I_MPI_DAPL_UD_RDMA_MIXED=enable[/plain]
I_MPI_ROOT=/opt/intel/impi/4.1.0.024
I still get the unwanted output. I did find one thing, if I just use one node then I do not get the output only if I use more than one of the older nodes. The unwanted output is easier to test with a short job (albeit still the complicated code).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Those settings were for the other thread, regarding the slowdown. Here, try unsetting DAPL_DBG.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
DAPL_DBG is not set. Should it be set/unset in the command line?
I will try the other settings *for the other thread(, but the nodes are currently in use so it will be some time (a day or more) before I can test that. Worse, it takes 24hrs to do the test.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
It should not be set. Is it possible that the DAPL providers on the older nodes are compiled with debug information?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The vendor of the cluster compiled OFED (which I think is what would be relevant, not sure). Prior to using impi I was using mvapich (and also openmpi) and with neither of these did I see anything similar, which suggests that debug information was not part of the compilation.
N.B., the older and newer nodes are on the same network with the same head node although they are physically connected to different switches with the newer switched daisy-chained to the older switch.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Ok. I'll check with our DAPL developer to see if he has any ideas regarding why you would be getting these messages.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The answer can be found at http://permalink.gmane.org/gmane.linux.drivers.rdma/4787, it appears that debug output was compiled in. I guess Intel mpi searches various options and if they fail moves on. Setting the environmental variable DAPL_DBG_TYPE 0 removes the output.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Understood. I'm glad everything is working correctly now.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have an urgent question. I have a 6 nodes using centos 7 OS.
When I set I_MPI_DEBUG to 1, a simple MPI program (matrix multiplication) running on the CPUs only printed the following error:
[15] MPI startup(): cannot open dynamic library libdat2.so.2
marcher5:SCM:345a:44a23f80: 29 us(29 us): open_hca: ibv_get_device_list() failed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sara,
In the future, please start a new thread for separate issues. This allows us to better address and track issues.
These errors are in the fabric selection process. You should only see delays (and only minor) in the initialization step. If you look at the debug output with I_MPI_DEBUG set to at least 2, you will see the full fabric selection process. You can select the DAPL provider yourself using I_MPI_DAPL_PROVIDER, and this will bypass attempts to work with other providers. You can also reorder the entries in your /etc/dat.conf file to put the desired provider first.
James.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page