Oversubscribing resources with specific MPI ranks

Schau__Kyle · ‎04-10-2020

I am looking for a way to specify which individual MPI ranks run on specific compute resources (for this case, lets just use a physical CPU)

I have a workload with very imbalanced MPI ranks, i.e. ranks 0,1,2 are computationally light while ranks 3,4,5 are very expensive.

So I would like to oversubscribe a cpu and place ranks 0,1,2 on one cpu, and ranks 3,4,5 their own cpus.

However I cannot find anything in the documentation of the machine file that is capable of this, just specifying simple things like how many processes go on a host, etc...

thanks for the help

PrasanthD_intel · ‎04-13-2020

Hi Kyle,

As far as I understood you want to run 3 ranks on one host and the remaining 3 ranks on other 3 different hosts.

You can do that by using Argument Sets

mpirun -n 3 -hosts node1 ./test : -n 1 -hosts node2 ./test : -n 1 -hosts node3 ./test :-n 1 -hosts node4 ./test

or you can assign a specific number of processes to particular nodes directly in the machine file.

For detailed explanation, you can refer to MPI Library Developer Guide

Thanks

Prasanth

Schau__Kyle · ‎04-13-2020

Hi Prasanth

That would work for this toy problem. But I am looking for a much more general solution. One in which an arbitrary number of arbitrary ranks are assigned to a single CPU, something like

Ranks 345,2,93,83 Assigned to Host X: CPU 1

Ranks 24,321 Assigned to Host Z: CPU 12

.....

And so on. Certainly done with a machinefile, but I cannot determine whether it is possible to be that specific.

Thanks

Kyle

PrasanthD_intel · ‎04-16-2020

Hi Kyle,

You may consider using the hostfile, which is scalable, where you can specify how many processes each node should run in the following format,

node1:3

node2:1

node3:1

node4:1

Also, we think that the pinning you are trying to achieve is not going to be very performant. It might be better off resolving the imbalance in your application.

Else if you were asking is how to run certain ranks on certain CPUs like rankfile in OpenMPI. In Intel MPI Library the concept of rankfile doesn’t exist.

Regards

Prasanth

Schau__Kyle · ‎04-17-2020

Hi Prasanth

Isn't process pinning such as this

https://software.intel.com/en-us/mpi-developer-reference-linux-process-pinning

Or even the hydra process managment

https://software.intel.com/en-us/mpi-developer-reference-windows-hydra-options

exactly how one would achieve this within a hostfile/machinefile? My issue is with the documentation it is not very explicit.

jimdempseyatthecove · ‎04-19-2020

Are each rank single threaded, or multi-threaded (e.g. OpenMP)?

Multi-threaded will (may) complicate what you need to do.

Note, a "single threaded" MPI process using the parallel MKL library is a multi-threaded process.

Jim Dempsey

Schau__Kyle · ‎04-19-2020

Hey Jim,

Each rank is single threaded.

jimdempseyatthecove · ‎04-20-2020

Do you have exclusive use of the cluster?
Or is the cluster a shared resource?

If (when) the cluster is shared, you might not want to hard code the ranks to nodes. Doing so may oversubscribe a node with several applications.

There are job submission scripts that aid in dynamically distributing an application based on available resources. One such utility is PBS:

https://www.pbsworks.com/pdfs/PBSUserGuide18.2.pdf

Jim Dempsey

Schau__Kyle · ‎04-20-2020

Yes we have exclusive access to the cluster.

We want to pin to processors as we know the computational load of each MPI process before run time, so we want to systematically group less expensive MPI processes together such that the group approaches the size of the largest MPI processes, as all MPI process proceed in sequence, we are looking to avoid light MPI processes sitting idle while waiting to sync with heavy MPI processes.

So pinning to a processor is important to keep these group sizes similar and stop oversubscribed processes from wandering within a node.

Kyle

PrasanthD_intel · ‎04-22-2020

Hi Kyle,

We would like to know some details to understand your problem in depth.

When you say expensive MPI process does it mean memory intensive or compute-intensive.

In intel MPI we can control rank to processor mapping on a particular node but for across nodes, we have no control.

This can be achieved using Process Pinning. For detailed explanations and examples refer to Developer reference Page70.

Can you give an example describing what you want to achieve so we can better understand your problem?

Thanks

Prasanth

Schau__Kyle · ‎04-22-2020

Hi Prasanth

Yes I will try to be more explicit.

For our situation, we have some number of MPI processes, lets say for example 100. We know ahead of time that MPI rank 55 will be the most expensive computationally. Our code is a fluid dynamics code that marches forward in time, with each MPI process is responsible for computing a portion of the geometric domain in the simulation.

This is how we know the expense of each MPI process, the larger the geometric domain each MPI rank is responsible for, the more computations it has to churn over. However, after a single march forward in time, all ranks must sync up and communicate. Therefore, when there is a large disparity between the MOST expensive and LEAST expensive (computationally) MPI ranks, there is a lot of idle time for the smaller MPI ranks as they wait for larger ones to finish their crunching.

Since we know exactly which MPI processes will be small, and even quantify how small they are, we want to pin these small MPI ranks together to the SAME compute unit, thus resulting in a more balanced distribution among the CPUs, a reduced number of required CPUs (saving on cluster allocation cost), etc.

In an ideal world we could do this with arbitrary ranks, lets say the sum of ranks 12,19, and 99 all combine to be the same size as our largest example rank 55, we would specify something like:

ranks 12,19,99 ----> share CPU 1 on host 4

rank 55 ---------> get your own CPU 2 on host 4

And so on until all 100 of our example MPI ranks are told where to go.

We do have the ability to reorder the MPI ranks in any order we want, such that we could make the computational expense of each MPI rank increase monotonically, i.e. the least computationally expensive rank is rank 0, the most expensive rank is 99... this has helped us utilize this scheme with other MPI implementations and works quite well.

jimdempseyatthecove · ‎04-23-2020

Consider introducing a rank translation table:

int vrank[nRanks];

Then use the process pinning (https://software.intel.com/en-us/mpi-developer-reference-linux-process-pinning) to layout the number of ranks per socket per host.

Then populate the vrank table with the rank numbers you wish to place there.

Then vrank[55] represents whatever rank falls on socket 2 of host 4.

Note, as a hack to avoid process pinning, using vrank[] you could oversubscribe the ranks, use 6 processes per node (3 per socket), then of the 3 that lay on host 4 socket 2 only 1 is used. IOW you have 100 working ranks and and 2 (or more) idle ranks.

Jim Dempsey

PrasanthD_intel · ‎04-28-2020

Hi Kyle,

1. Do you use adaptive mesh refinement in your code? We would like to understand if re-meshing happens across timesteps.
2. Is the cost data known before starting the MPI run?

Actually, in any case, based on your requirements, we recommend using a library like ParMETIS (http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview) which does scalable domain decomposition in an optimal manner, far better than anything we can achieve with manual pinning settings. With libraries like ParMETIS it is possible to decompose a given domain in a manner that minimizes the interface nodes and at the same time also creates rank specific load balanced domain partitions.

Regards

Prasanth

Schau__Kyle · ‎04-28-2020

Prasanth

We do not us adaptive mesh refinement.

Yes the cost is known via a preprocessing step... no MPI run is required.

And ParMETIS is utilized for unstructured solvers in which the domain can begin as a contiguous set and arbitrarily decomposed. Our solver is a structured solver that is beholden to the block topology required by the geometry. There are ways to implement this balancing programically on our end, but it represents a massive undertaking. We also do not come across these types of scenarios with the research we do a lot, thus our search for a simple solution via MPI to get through the research projects that do require it.

PrasanthD_intel · ‎05-04-2020

Hi Kyle,

We built on top of your scenario. As you have mentioned the following kind of pinning and node distribution,
ranks 12,19,99 ----> share CPU 1 on host 4
rank 55 ---------> get your own CPU 2 on host 4
completing the rest arbitrarily in the following manner, so as to demonstrate an example command line. Total ranks assumed = 100.

Node name	Ranks	                                           Physical cores	Ranks per node
S001-n057	0-11 (default core pinning)	                           24	          12 (under-subscribed)
S001-n015	13-18, 20-37 (default core pinning)                    24	          24 (fully-subscribed)
S001-n058	38-54, 56-74 (default core pinning)	                   24	          36 (over-subscribed)
S001-n016	C0-12,19,99 ; C1-55 ; C2 to C23-75-99 
             (user defined core pinning)	                       24	          28 (mildly over-subscribed)

We will be using argument sets to provision user defined rank to node mapping. And within a single node, we will achieve the desired rank to core mapping using I_MPI_PIN_PROCESSOR_LIST. As per the your request I have pinned ranks 12,19 and 99 to core 0, rank 55 to core 1 and core 2 to 23 have been assigned in a round robin fashion to ranks 75 to 99.

Following command line exactly achieves the above rank to core and rank to node mapping/pinning,
I_MPI_DEBUG=5 mpiexec.hydra -n 12 -host s001-n057 ./test : -n 1 -env I_MPI_PIN_PROCESSOR_LIST=0,0,1,2-23,2-3,0 -host s001-n016 ./test : -n 6 -host s001-n015 ./test : -n 1 -host s001-n016 ./test : -n 18 -host s001-n015 ./test : -n 17 -host s001-n058 ./test : -n 1 -host s001-n016 ./test : -n 19 -host s001-n058 ./test : -n 25 -host s001-n016 ./test

With the debug level 5 setting, one can confirm the core and node pinning. See below,
[0] MPI startup(): libfabric version: 1.10.0a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 28455 s001-n057 {0,12}
[0] MPI startup(): 1 28456 s001-n057 {1,13}
[0] MPI startup(): 2 28457 s001-n057 {2,14}
[0] MPI startup(): 3 28458 s001-n057 {3,15}
[0] MPI startup(): 4 28459 s001-n057 {4,16}
[0] MPI startup(): 5 28460 s001-n057 {5,17}
[0] MPI startup(): 6 28461 s001-n057 {6,18}
[0] MPI startup(): 7 28462 s001-n057 {7,19}
[0] MPI startup(): 8 28463 s001-n057 {8,20}
[0] MPI startup(): 9 28464 s001-n057 {9,21}
[0] MPI startup(): 10 28465 s001-n057 {10,22}
[0] MPI startup(): 11 28466 s001-n057 {11,23}
[0] MPI startup(): 12 16797 s001-n016 {0}
[0] MPI startup(): 13 19425 s001-n015 {0}
[0] MPI startup(): 14 19426 s001-n015 {12}
[0] MPI startup(): 15 19427 s001-n015 {1}
[0] MPI startup(): 16 19428 s001-n015 {13}
[0] MPI startup(): 17 19429 s001-n015 {2}
[0] MPI startup(): 18 19430 s001-n015 {14}
[0] MPI startup(): 19 16798 s001-n016 {0}
[0] MPI startup(): 20 19431 s001-n015 {3}
[0] MPI startup(): 21 19432 s001-n015 {15}
[0] MPI startup(): 22 19433 s001-n015 {4}
[0] MPI startup(): 23 19434 s001-n015 {16}
[0] MPI startup(): 24 19435 s001-n015 {5}
[0] MPI startup(): 25 19436 s001-n015 {17}
[0] MPI startup(): 26 19437 s001-n015 {6}
[0] MPI startup(): 27 19438 s001-n015 {18}
[0] MPI startup(): 28 19439 s001-n015 {7}
[0] MPI startup(): 29 19440 s001-n015 {19}
[0] MPI startup(): 30 19441 s001-n015 {8}
[0] MPI startup(): 31 19442 s001-n015 {20}
[0] MPI startup(): 32 19443 s001-n015 {9}
[0] MPI startup(): 33 19444 s001-n015 {21}
[0] MPI startup(): 34 19445 s001-n015 {10}
[0] MPI startup(): 35 19446 s001-n015 {22}
[0] MPI startup(): 36 19447 s001-n015 {11}
[0] MPI startup(): 37 19448 s001-n015 {23}
[0] MPI startup(): 38 18619 s001-n058 {0}
[0] MPI startup(): 39 18620 s001-n058 {12}
[0] MPI startup(): 40 18621 s001-n058 {1}
[0] MPI startup(): 41 18622 s001-n058 {13}
[0] MPI startup(): 42 18623 s001-n058 {2}
[0] MPI startup(): 43 18624 s001-n058 {14}
[0] MPI startup(): 44 18625 s001-n058 {3}
[0] MPI startup(): 45 18626 s001-n058 {15}
[0] MPI startup(): 46 18627 s001-n058 {4}
[0] MPI startup(): 47 18628 s001-n058 {16}
[0] MPI startup(): 48 18629 s001-n058 {5}
[0] MPI startup(): 49 18630 s001-n058 {17}
[0] MPI startup(): 50 18631 s001-n058 {6}
[0] MPI startup(): 51 18632 s001-n058 {18}
[0] MPI startup(): 52 18633 s001-n058 {7}
[0] MPI startup(): 53 18634 s001-n058 {19}
[0] MPI startup(): 54 18635 s001-n058 {8}
[0] MPI startup(): 55 16799 s001-n016 {1}
[0] MPI startup(): 56 18636 s001-n058 {20}
[0] MPI startup(): 57 18637 s001-n058 {9}
[0] MPI startup(): 58 18638 s001-n058 {21}
[0] MPI startup(): 59 18639 s001-n058 {10}
[0] MPI startup(): 60 18640 s001-n058 {22}
[0] MPI startup(): 61 18641 s001-n058 {11}
[0] MPI startup(): 62 18642 s001-n058 {23}
[0] MPI startup(): 63 18643 s001-n058 {0}
[0] MPI startup(): 64 18644 s001-n058 {12}
[0] MPI startup(): 65 18645 s001-n058 {1}
[0] MPI startup(): 66 18646 s001-n058 {13}
[0] MPI startup(): 67 18647 s001-n058 {2}
[0] MPI startup(): 68 18648 s001-n058 {14}
[0] MPI startup(): 69 18649 s001-n058 {3}
[0] MPI startup(): 70 18650 s001-n058 {15}
[0] MPI startup(): 71 18651 s001-n058 {4}
[0] MPI startup(): 72 18652 s001-n058 {16}
[0] MPI startup(): 73 18653 s001-n058 {5}
[0] MPI startup(): 74 18654 s001-n058 {17}
[0] MPI startup(): 75 16800 s001-n016 {2}
[0] MPI startup(): 76 16801 s001-n016 {3}
[0] MPI startup(): 77 16802 s001-n016 {4}
[0] MPI startup(): 78 16803 s001-n016 {5}
[0] MPI startup(): 79 16804 s001-n016 {6}
[0] MPI startup(): 80 16805 s001-n016 {7}
[0] MPI startup(): 81 16806 s001-n016 {8}
[0] MPI startup(): 82 16807 s001-n016 {9}
[0] MPI startup(): 83 16808 s001-n016 {10}
[0] MPI startup(): 84 16809 s001-n016 {11}
[0] MPI startup(): 85 16810 s001-n016 {12}
[0] MPI startup(): 86 16811 s001-n016 {13}
[0] MPI startup(): 87 16812 s001-n016 {14}
[0] MPI startup(): 88 16813 s001-n016 {15}
[0] MPI startup(): 89 16814 s001-n016 {16}
[0] MPI startup(): 90 16815 s001-n016 {17}
[0] MPI startup(): 91 16816 s001-n016 {18}
[0] MPI startup(): 92 16817 s001-n016 {19}
[0] MPI startup(): 93 16818 s001-n016 {20}
[0] MPI startup(): 94 16819 s001-n016 {21}
[0] MPI startup(): 95 16820 s001-n016 {22}
[0] MPI startup(): 96 16821 s001-n016 {23}
[0] MPI startup(): 97 16822 s001-n016 {2}
[0] MPI startup(): 98 16823 s001-n016 {3}
[0] MPI startup(): 99 16824 s001-n016 {0}
[0] MPI startup(): I_MPI_ROOT=/glob/development-tools/versions/oneapi/beta05/inteloneapi/mpi/2021.1-beta05
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=5

It took us a few iterations to prepare this command line, but it works . I felt that the use of I_MPI_PIN_PROCESSOR_LIST in the context of argument sets is non-intuitive. The first usage for a certain node has an effect on the remaining sub-commands for that particular node. Which means ,when we set process pinning for node s001-n016 in first usage it follows the same pinning for all the subsequent usage of that particular node.

Let me know if you have any questions.

Regards

Prasanth

PrasanthD_intel · ‎05-13-2020

Hi Kyle,

Could you please confirm whether the given command works for you.

Else you can reach out to us.

Prasanth

PrasanthD_intel · ‎05-21-2020

Hi Kyle,

We are closing this thread considering your issue is resolved with the given solution.

Raise a new thread for any such further queries.

- Prasanth