Thank you very much James.

Girish_Nair · ‎08-12-2013

Hi,

Is it possible to run task on compute nodes having InfiniBand HCA from a master node that lacks IB HCA using Torque/Grid Engine?

Please guide if it is possible.

Intel MPI 4.1.1.036 is installed on all cluster machines.

The network configuration is as follows:

Master Node: (2xXeon E5-2450/96GB/CentOS 6.2/NFS Services over Ethernet) - 1 No.

Compute Nodes: (2xXeon E5-2450/96GB/TrueScale QDR Dual-port QLE7342/CentOS 6.2/NFS Client over GbE) - 4 Nos.

IP Addresses (GbE) : Master Node: 10.0.0.221 - Hostname: mnode; Compute Nodes: 10.0.0.222 .. 225 - Hostnames: c00 .. c03

IP Address (ib0): Master Node: N/A; Compute Nodes: 192.168.10.222 .. 225 - /etc/hosts -> c00-ib; c01-ib; c02-ib; c03-ib

Additionally, if mpiexec.hydra can be used, then what is the command-line from master node to directly run without Torque or Grid Engine.

Regards

Girish Nair <girishnairisonline at gmail dot com>

James_T_Intel · ‎08-14-2013

Hi Girish,

Running outside of the job scheduler will depend on your system's policy. If it is setup to allow it, then there should be no problem using mpiexec.hydra (or mpirun, which will default to mpiexec.hydra) to run. Simply specify your hosts and ranks as you normally would. For some additional information, see the article Controlling Process Placement with the Intel® MPI Library.

If you are going to use InfiniBand* for your job nodes, but are launching from a system without IB, you will need to specify the network interface using either the -iface command line option or the I_MPI_HYDRA_IFACE environment variable. You'll likely want to use eth0, but this can vary depending on your system configuration.

Also, do not use the IB host names to start your job. Hydra will attempt to connect via ssh first, which needs to happen through the standard IP channel. It will handle switching to the IB fabric for your job. If you want to verify that it correctly launched using IB, run with I_MPI_DEBUG=2 to get fabric selection information.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Girish_Nair · ‎08-15-2013

Hi James, Thanks for your response. Please correct me if I'm wrongly understood your statement.: The machinefile would have the entries like: n01:16 # hostname resolving to eth0 address n02:16 # hostname resolving to eth0 address n03:16 # hostname resolving to eth0 address n04:16 # hostname resolving to eth0 address while running from the master node not having an IB hardware: mpiexec.hydra -np 16 -machinefile ./machine.cluster -iface ib0 ./main.out I apologize if this is too much to ask. Great if an example is provided. Additionally, does the same command line accept the following alongwith the above command: mpiexec.hydra ... -genv I_MPI_FABRICS shm:tmi ... as my understanding is the shm:dapl is default, and I've found that shm:tmi gives me the best performance over IB. The master node obviously will not have /etc/dat.conf file, since it lacks IB HCA. My advance thanks for your expert advise. Regards Girish Nair

James_T_Intel · ‎08-15-2013

If tmi is always better than DAPL for you, you can set I_MPI_FABRICS=shm:tmi in your environment rather than having to pass it every time. As for launching, unless you have an interface named ib0 on your master node, you'll want to use:

[plain]mpirun -n 16 -machinefile ./machine.cluster -iface eth0 ./main.out[/plain]

The machinefile you have is correct. Now, keep in mind, if you run this job with the machinefile you have, all 16 ranks will run on n01. For more flexibility, I would use a hostfile instead.

[plain]$cat hostfile

n01

n02

n03

n04[/plain]

And run with

[plain]mpirun -n <nranks> -ppn <ranks per node> -f hostfile ./main.out[/plain]

This will run a total of <nranks> ranks, with <ranks per node> ranks placed on each of the nodes. So, if I wanted to run 16 ranks, with 4 per node, that would be

[plain]mpirun -n 16 -ppn 4 -f hostfile ./main.out[/plain]

This gives more flexibility in process placement.The article I linked shows several other options, and I'll add more information about the hostfile capability.

Girish_Nair · ‎08-15-2013

Hi James, Ah, that was a quick response from you. Thanks. Please read my mpiexec.hydra command as -np 64 and not -np 16. 2 quick queries: a) If -iface eth0 is used, would the job be run on IB on Compute Nodes? b) Can the environment variable I_MPI_FABRICS be set on Master Node that lacks IB HCA hardware? If no, then should it be set on all Compute Nodes with IB HCA? Thanks Girish Nair

James_T_Intel · ‎08-15-2013

Using -iface eth0 sets the interface to be used for launching the ranks, not the communication fabric to be used by MPI.

I_MPI_FABRICS needs to be set wherever you are launching the job. Hydra will read this before launching and use it when launching the ranks.

Girish_Nair · ‎08-15-2013

Thanks a ton James. You've effectively cleared all my doubts on this. I'll wait for your notes on hostfile capability whenever you publish it. Thanks once again. ~Girish Nair

James_T_Intel · ‎08-16-2013

The article was updated yesterday, if you can't see the updates, please let me know.

Girish_Nair · ‎08-16-2013

Thank you very much James.

Mous_Tatarkhanov · ‎03-10-2015

Hi James,

Following up with this thread, referencing to your article http://software.intel.com/en-us/articles/controlling-process-placement-with-the-intel-mpi-library, and IntelMPI5.0 Linux Reference Manual, there are three configuration ways to launch MPMD cluster: -hostfile, -machinefile, -configfile.

What are the difference between -hostfile and -machinefile ?

Can we use per rank HCA binding in -machinefile or -hostfile (as it can be done with single HCA for MVAPICH mpihydra hostfile) ? We want to control HCA per rank, possibly multiple HCAs per rank.

I believe machinefile/hostfile and configfile can be used in a single launch cmd. I would very much appreciate some references with detailed examples and explanations on how are all these three used interchangeably?

Thanks,
Mous.

Running parallel job on compute nodes with IB HCA from a master node "NOT" having IB HCA