Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2161 Discussions

mpitune only sees part of my cluster

pmonday
Beginner
729 Views
I am trying to "tune" a 16 node cluster with mpitune. My application runs on this segment of the cluster fine (see a different thread for a problem where other nodes with non ib0 cards do not participate in the cluster ... that's not the problem here).

The debug information from mpitune show that several of the machines get "Skiped" [sic...] during some sort of DNS resolution step.

At first I figured that the process just didn't recognize my host file (which, incidentally, does not contain head-n2), but this looks like more of a problem with resolving the names themselves. Almost like cluster-n1 and cluster-n11 are treated as duplicates and get skipped. The IP addresses definitely resolve to different numbers.

Is there another reason these nodes may get skipped?

[root@head-n2 mp_linpack]# mpitune -hf nodes -fl rdma -dl -pr 128 -hr 16 --application \\"mpiexec -machinefile nodes -np 128 /mnt/shared/apps/mp_linpack/xhpl_intel64\\" -of ./linpack.conf -d
20'May'11 09:11:51 INF | Starting. Please wait...

20'May'11 09:11:51 | MPITune started at 20 May'11 (Friday) 15:11:51
20'May'11 09:11:51 | MPITune has been started by: root
20'May'11 09:11:51 | Preparing tuner's components...
20'May'11 09:11:51 DBG | Session's ID is 1305904311
20'May'11 09:11:51 DBG | MPITuner has been executed by follow command: ' /opt/intel/impi/4.0.1.007/intel64/bin/tune/mpitune -hf nodes -fl rdma -dl -pr 128 -hr 16 --application "mpiexec -machinefile nodes -np 128 /mnt/shared/apps/mp_linpack/xhpl_intel64" -of ./linpack.conf -d'
20'May'11 09:11:51 | Initialization of signals handlers...
20'May'11 09:11:51 | Start catching signal with code 15 (SIGTERM) ...
20'May'11 09:11:51 | Success.
20'May'11 09:11:51 | Start catching signal with code 2 (SIGINT) ...
20'May'11 09:11:51 | Success.
20'May'11 09:11:51 | Initialization of signals handlers completed.
20'May'11 09:11:51 DBG | Extracted tuner's executable part of run line: '/opt/intel/impi/4.0.1.007/intel64/bin/tune/mpitune'
20'May'11 09:11:51 DBG | Parsed command line arguments' dictionary:
{
'application' : 'mpiexec -machinefile nodes -np 128 /mnt/shared/apps/mp_linpack/xhpl_intel64'
'dl' : ''
'fl' : 'rdma'
'hf' : 'nodes'
'hr' : '16'
'of' : './linpack.conf'
'pr' : '128'
}
20'May'11 09:11:51 DBG | Initialization of configurator object...
20'May'11 09:11:51 ERR | CFileManager::CreateEmptyFile()
local variable 'dir_path' referenced before assignment
20'May'11 09:11:51 DBG | I_MPI_ROOT variable was found in environment ('/opt/intel/impi/4.0.1.007')
20'May'11 09:11:51 DBG | Checking IMPI root directory...
20'May'11 09:11:51 DBG | Checking IMPI root directory complete.
20'May'11 09:11:51 | Obtained following information about Intel MPI Library:
MPI Root : /opt/intel/impi/4.0.1.007
MPI Bin : /opt/intel/impi/4.0.1.007/bin64
MPI Version: 4.0
MPI Build : 20100910
mpiexec : /opt/intel/impi/4.0.1.007/bin64/mpiexec
20'May'11 09:11:51 DBG | Short version of MPI library is '4'
20'May'11 09:11:51 | No batch system has been detected.
20'May'11 09:11:51 DBG | Clear hostname/IP list (uniq=False)
20'May'11 09:11:51 DBG | Source list:
{
'cluster-n0'
'cluster-n1'
'cluster-n10'
'cluster-n11'
'cluster-n12'
'cluster-n13'
'cluster-n14'
'cluster-n15'
'cluster-n2'
'cluster-n3'
'cluster-n4'
'cluster-n5'
'cluster-n6'
'cluster-n7'
'cluster-n8'
'cluster-n9'
'head-n2'
}
20'May'11 09:11:51 DBG | Work with cluster-n0
20'May'11 09:11:51 DBG | cluster-n0 is the DNS-name
20'May'11 09:11:51 DBG | Check of cluster-n0 passed
20'May'11 09:11:51 DBG | Work with cluster-n1
20'May'11 09:11:51 DBG | cluster-n1 is the DNS-name
20'May'11 09:11:51 DBG | Check of cluster-n1 passed
20'May'11 09:11:51 DBG | Work with cluster-n10
20'May'11 09:11:51 DBG | cluster-n10 is the DNS-name
20'May'11 09:11:51 DBG | Item cluster-n10 already exists. Skiped.
20'May'11 09:11:51 DBG | Work with cluster-n11
20'May'11 09:11:51 DBG | cluster-n11 is the DNS-name
20'May'11 09:11:51 DBG | Item cluster-n11 already exists. Skiped.
20'May'11 09:11:51 DBG | Work with cluster-n12
20'May'11 09:11:51 DBG | cluster-n12 is the DNS-name
20'May'11 09:11:51 DBG | Item cluster-n12 already exists. Skiped.
20'May'11 09:11:51 DBG | Work with cluster-n13
20'May'11 09:11:51 DBG | cluster-n13 is the DNS-name
20'May'11 09:11:51 DBG | Item cluster-n13 already exists. Skiped.
20'May'11 09:11:51 DBG | Work with cluster-n14
20'May'11 09:11:51 DBG | cluster-n14 is the DNS-name
20'May'11 09:11:51 DBG | Item cluster-n14 already exists. Skiped.
20'May'11 09:11:51 DBG | Work with cluster-n15
20'May'11 09:11:51 DBG | cluster-n15 is the DNS-name
20'May'11 09:11:51 DBG | Item cluster-n15 already exists. Skiped.
20'May'11 09:11:51 DBG | Work with cluster-n2
20'May'11 09:11:51 DBG | cluster-n2 is the DNS-name
20'May'11 09:11:51 DBG | Check of cluster-n2 passed
20'May'11 09:11:51 DBG | Work with cluster-n3
20'May'11 09:11:51 DBG | cluster-n3 is the DNS-name
20'May'11 09:11:51 DBG | Check of cluster-n3 passed
20'May'11 09:11:51 DBG | Work with cluster-n4
20'May'11 09:11:51 DBG | cluster-n4 is the DNS-name
20'May'11 09:11:51 DBG | Check of cluster-n4 passed
20'May'11 09:11:51 DBG | Work with cluster-n5
20'May'11 09:11:51 DBG | cluster-n5 is the DNS-name
20'May'11 09:11:51 DBG | Check of cluster-n5 passed
20'May'11 09:11:51 DBG | Work with cluster-n6
20'May'11 09:11:51 DBG | cluster-n6 is the DNS-name
20'May'11 09:11:51 DBG | Check of cluster-n6 passed
20'May'11 09:11:51 DBG | Work with cluster-n7
20'May'11 09:11:51 DBG | cluster-n7 is the DNS-name
20'May'11 09:11:51 DBG | Check of cluster-n7 passed
20'May'11 09:11:51 DBG | Work with cluster-n8
20'May'11 09:11:51 DBG | cluster-n8 is the DNS-name
20'May'11 09:11:51 DBG | Check of cluster-n8 passed
20'May'11 09:11:51 DBG | Work with cluster-n9
20'May'11 09:11:51 DBG | cluster-n9 is the DNS-name
20'May'11 09:11:51 DBG | Check of cluster-n9 passed
20'May'11 09:11:51 DBG | Work with head-n2
20'May'11 09:11:51 DBG | head-n2 is the DNS-name
20'May'11 09:11:51 DBG | Check of head-n2 passed
20'May'11 09:11:51 DBG | Nodes dictionary:
{
'cluster-n0' : '['10.10.10.20']'
'cluster-n1' : '['10.10.10.21']'
'cluster-n2' : '['10.10.10.22']'
'cluster-n3' : '['10.10.10.23']'
'cluster-n4' : '['10.10.10.24']'
'cluster-n5' : '['10.10.10.25']'
'cluster-n6' : '['10.10.10.26']'
'cluster-n7' : '['10.10.10.27']'
'cluster-n8' : '['10.10.10.28']'
'cluster-n9' : '['10.10.10.29']'
'head-n2' : '['10.10.10.12']'
}
20'May'11 09:11:51 DBG | Check for uniq IP...
20'May'11 09:11:51 DBG | Skiped.
20'May'11 09:11:51 DBG | Complete.
20'May'11 09:11:51 WRN | Left margin of the range of 'hosts-number' argument (16) is greater than maximum from the availiable range [1:11]. Cutted for maximum.
20'May'11 09:11:51 WRN | Right margin of the range of 'hosts-number' argument (16) is greater than maximum from the availiable range [1:11]. Cutted for maximum.
20'May'11 09:11:51 DBG | Saving hosts list (['cluster-n0', 'cluster-n1', 'cluster-n2', 'cluster-n3', 'cluster-n4', 'cluster-n5', 'cluster-n6', 'cluster-n7', 'cluster-n8', 'cluster-n9', 'head-n2']) to file (/mnt/shared/apps/mp_linpack/mpitune_out/mpituner_1305904311.hosts).
20'May'11 09:11:51 DBG | Saving hosts list to file completed.
20'May'11 09:11:51 DBG | Nodes into the stored local host file:
{
'cluster-n0'
'cluster-n1'
'cluster-n2'
'cluster-n3'
'cluster-n4'
'cluster-n5'
'cluster-n6'
'cluster-n7'
'cluster-n8'
'cluster-n9'
'head-n2'
}
20'May'11 09:11:51 | Bringing up MPD ring...
20'May'11 09:11:51 | Using command: mpdboot -n 11 -f /mnt/shared/apps/mp_linpack/mpitune_out/mpituner_1305904311.hosts -r ssh -o

0 Kudos
2 Replies
Dmitry_K_Intel2
Employee
729 Views
Hi Paul,

Thank you for pointing this out. This is a real bug.
As a workaround you can use --skip-check-hosts option.

Regards!
Dmitry

0 Kudos
pmonday
Beginner
729 Views
Thank you Dmitry, this change helped quite a bit and I was able to see my complete cluster with the --skip-check-hosts option. I'm still not getting good results but at least I'm getting farther in the process, I'll post a separate thread to address the next issue though as it is not related to this one.

I appreciate your response and help :)
0 Kudos
Reply