Is it possible to run task on compute nodes having InfiniBand HCA from a master node that lacks IB HCA using Torque/Grid Engine?
Please guide if it is possible.
Intel MPI 4.1.1.036 is installed on all cluster machines.
The network configuration is as follows:
Master Node: (2xXeon E5-2450/96GB/CentOS 6.2/NFS Services over Ethernet) - 1 No.
Compute Nodes: (2xXeon E5-2450/96GB/TrueScale QDR Dual-port QLE7342/CentOS 6.2/NFS Client over GbE) - 4 Nos.
IP Addresses (GbE) : Master Node: 10.0.0.221 - Hostname: mnode; Compute Nodes: 10.0.0.222 .. 225 - Hostnames: c00 .. c03
IP Address (ib0): Master Node: N/A; Compute Nodes: 192.168.10.222 .. 225 - /etc/hosts -> c00-ib; c01-ib; c02-ib; c03-ib
Additionally, if mpiexec.hydra can be used, then what is the command-line from master node to directly run without Torque or Grid Engine.
Girish Nair <girishnairisonline at gmail dot com>
Running outside of the job scheduler will depend on your system's policy. If it is setup to allow it, then there should be no problem using mpiexec.hydra (or mpirun, which will default to mpiexec.hydra) to run. Simply specify your hosts and ranks as you normally would. For some additional information, see the article Controlling Process Placement with the Intel® MPI Library.
If you are going to use InfiniBand* for your job nodes, but are launching from a system without IB, you will need to specify the network interface using either the -iface command line option or the I_MPI_HYDRA_IFACE environment variable. You'll likely want to use eth0, but this can vary depending on your system configuration.
Also, do not use the IB host names to start your job. Hydra will attempt to connect via ssh first, which needs to happen through the standard IP channel. It will handle switching to the IB fabric for your job. If you want to verify that it correctly launched using IB, run with I_MPI_DEBUG=2 to get fabric selection information.
Technical Consulting Engineer
Intel® Cluster Tools
If tmi is always better than DAPL for you, you can set I_MPI_FABRICS=shm:tmi in your environment rather than having to pass it every time. As for launching, unless you have an interface named ib0 on your master node, you'll want to use:
[plain]mpirun -n 16 -machinefile ./machine.cluster -iface eth0 ./main.out[/plain]
The machinefile you have is correct. Now, keep in mind, if you run this job with the machinefile you have, all 16 ranks will run on n01. For more flexibility, I would use a hostfile instead.
And run with
[plain]mpirun -n <nranks> -ppn <ranks per node> -f hostfile ./main.out[/plain]
This will run a total of <nranks> ranks, with <ranks per node> ranks placed on each of the nodes. So, if I wanted to run 16 ranks, with 4 per node, that would be
[plain]mpirun -n 16 -ppn 4 -f hostfile ./main.out[/plain]
This gives more flexibility in process placement.The article I linked shows several other options, and I'll add more information about the hostfile capability.
Using -iface eth0 sets the interface to be used for launching the ranks, not the communication fabric to be used by MPI.
I_MPI_FABRICS needs to be set wherever you are launching the job. Hydra will read this before launching and use it when launching the ranks.
Following up with this thread, referencing to your article http://software.intel.com/en-us/articles/controlling-process-placement-with-the-intel-mpi-library, and IntelMPI5.0 Linux Reference Manual, there are three configuration ways to launch MPMD cluster: -hostfile, -machinefile, -configfile.
What are the difference between -hostfile and -machinefile ?
Can we use per rank HCA binding in -machinefile or -hostfile (as it can be done with single HCA for MVAPICH mpihydra hostfile) ? We want to control HCA per rank, possibly multiple HCAs per rank.
I believe machinefile/hostfile and configfile can be used in a single launch cmd. I would very much appreciate some references with detailed examples and explanations on how are all these three used interchangeably?