Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Oversubscribing resources with specific MPI ranks

Schau__Kyle
Beginner
2,134 Views

I am looking for a way to specify which individual MPI ranks run on specific compute resources (for this case, lets just use a physical CPU)

I have a workload with very imbalanced MPI ranks, i.e. ranks 0,1,2 are computationally light while ranks 3,4,5 are very expensive.

 

So I would like to oversubscribe a cpu and place ranks 0,1,2 on one cpu, and ranks 3,4,5 their own cpus.

However I cannot find anything in the documentation of the machine file that is capable of this, just specifying simple things like how many processes go on a host, etc...

 

thanks for the help

0 Kudos
16 Replies
PrasanthD_intel
Moderator
2,134 Views

Hi Kyle,

As far as I understood you want to run 3 ranks on one host and the remaining 3 ranks on other 3 different hosts.

You can do that by using Argument Sets

mpirun -n 3 -hosts node1 ./test : -n 1 -hosts node2 ./test : -n 1 -hosts node3 ./test  :-n 1 -hosts node4 ./test

or  you can assign a specific number of processes to particular nodes directly in the machine file.

For detailed explanation, you can refer to  MPI Library Developer Guide

 

Thanks

Prasanth

 

0 Kudos
Schau__Kyle
Beginner
2,134 Views

Hi Prasanth

 

That would work for this toy problem. But I am looking for a much more general solution. One in which an arbitrary number of arbitrary ranks are assigned to a single CPU, something like

 

Ranks 345,2,93,83 Assigned to Host X: CPU 1

Ranks 24,321 Assigned to Host Z: CPU 12

.....

And so on. Certainly done with a machinefile, but I cannot determine whether it is possible to be that specific.

 

Thanks

 

Kyle

0 Kudos
PrasanthD_intel
Moderator
2,134 Views

Hi Kyle,

You may consider using the hostfile, which is scalable, where you can specify how many processes each node should run in the following format,

node1:3

node2:1

node3:1

node4:1

Also, we think that the pinning you are trying to achieve is not going to be very performant. It might be better off resolving the imbalance in your application.

Else if you were asking is how to run certain ranks on certain CPUs like rankfile in OpenMPI. In Intel MPI Library the concept of rankfile doesn’t exist.

 

Regards

Prasanth

0 Kudos
Schau__Kyle
Beginner
2,134 Views

Hi Prasanth

 

Isn't process pinning such as this 

https://software.intel.com/en-us/mpi-developer-reference-linux-process-pinning

 

Or even the hydra process managment

 https://software.intel.com/en-us/mpi-developer-reference-windows-hydra-options

exactly how one would achieve this within a hostfile/machinefile? My issue is with the documentation it is not very explicit.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,134 Views

Are each rank single threaded, or multi-threaded (e.g. OpenMP)?

Multi-threaded will (may) complicate what you need to do.

Note, a "single threaded" MPI process using the parallel MKL library is a multi-threaded process.

Jim Dempsey

 

0 Kudos
Schau__Kyle
Beginner
2,134 Views

Hey Jim,

Each rank is single threaded. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,134 Views

Do you have exclusive use of the cluster?
Or is the cluster a shared resource?

If (when) the cluster is shared, you might not want to hard code the ranks to nodes. Doing so may oversubscribe a node with several applications.

There are job submission scripts that aid in dynamically distributing an application based on available resources. One such utility is PBS:

https://www.pbsworks.com/pdfs/PBSUserGuide18.2.pdf

Jim Dempsey

 

0 Kudos
Schau__Kyle
Beginner
2,134 Views

Yes we have exclusive access to the cluster.

 

We want to pin to processors as we know the computational load of each MPI process before run time, so we want to systematically group less expensive MPI processes together such that the group approaches the size of the largest MPI processes, as all MPI process proceed in sequence, we are looking to avoid light MPI processes sitting idle while waiting to sync with heavy MPI processes.

So pinning to a processor is important to keep these group sizes similar and stop oversubscribed processes from wandering within a node.

 

Kyle

0 Kudos
PrasanthD_intel
Moderator
2,134 Views

Hi Kyle,

We would like to know some details to understand your problem in depth.

When you say expensive MPI process does it mean memory intensive or compute-intensive.

In intel MPI we can control rank to processor mapping on a particular node but for across nodes, we have no control.

This can be achieved using Process Pinning. For detailed explanations and examples refer to Developer reference Page70.

 

Can you give an example describing what you want to achieve so we can better understand your problem?

 

Thanks

Prasanth

0 Kudos
Schau__Kyle
Beginner
2,134 Views

Hi Prasanth

 

Yes I will try to be more explicit.

 

For our situation, we have some number of MPI processes, lets say for example 100. We know ahead of time that MPI rank 55 will be the most expensive computationally. Our code is a fluid dynamics code that marches forward in time, with each MPI process is responsible for computing a portion of the geometric domain in the simulation.

This is how we know the expense of each MPI process, the larger the geometric domain each MPI rank is responsible for, the more computations it has to churn over. However, after a single march forward in time, all ranks must sync up and communicate. Therefore, when there is a large disparity between the MOST expensive and LEAST expensive (computationally) MPI ranks, there is a lot of idle time for the smaller MPI ranks as they wait for larger ones to finish their crunching.

 

Since we know exactly which MPI processes will be small, and even quantify how small they are, we want to pin these small MPI ranks together to the SAME compute unit, thus resulting in a more balanced distribution among the CPUs, a reduced number of required CPUs (saving on cluster allocation cost), etc.

In an ideal world we could do this with arbitrary ranks, lets say the sum of ranks 12,19, and 99 all combine to be the same size as our largest example rank 55, we would specify something like:

ranks 12,19,99  ----> share CPU 1 on host 4

rank 55 ---------> get your own CPU 2 on host 4

And so on until all 100 of our example MPI ranks are told where to go.

We do have the ability to reorder the MPI ranks in any order we want, such that we could make the computational expense of each MPI rank increase monotonically, i.e. the least computationally expensive rank is rank 0, the most expensive rank is 99... this has helped us utilize this scheme with other MPI implementations and works quite well. 

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,134 Views

Consider introducing a rank translation table:

int vrank[nRanks];

Then use the process pinning (https://software.intel.com/en-us/mpi-developer-reference-linux-process-pinning) to layout the number of ranks per socket per host.

Then populate the vrank table with the rank numbers you wish to place there.

Then vrank[55] represents whatever rank falls on socket 2 of host 4.

Note, as a hack to avoid process pinning, using vrank[] you could oversubscribe the ranks, use 6 processes per node (3 per socket), then of the 3 that lay on host 4 socket 2 only 1 is used. IOW you have 100 working ranks and and 2 (or more) idle ranks.

Jim Dempsey

0 Kudos
PrasanthD_intel
Moderator
2,134 Views

Hi Kyle,

1.    Do you use adaptive mesh refinement in your code? We would like to understand if re-meshing happens across timesteps.
2.    Is the cost data known before starting the MPI run?

Actually, in any case, based on your requirements, we recommend using a library like ParMETIS (http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview) which does scalable domain decomposition in an optimal manner, far better than anything we can achieve with manual pinning settings. With libraries like ParMETIS it is possible to decompose a given domain in a manner that minimizes the interface nodes and at the same time also creates rank specific load balanced domain partitions.

Regards

Prasanth

0 Kudos
Schau__Kyle
Beginner
2,134 Views

Prasanth

 

We do not us adaptive mesh refinement.

Yes the cost is known via a preprocessing step... no MPI run is required.

 

And ParMETIS is utilized for unstructured solvers in which the domain can begin as a contiguous set and arbitrarily decomposed. Our solver is a structured solver that is beholden to the block topology required by the geometry. There are ways to implement this balancing programically on our end, but it represents a massive undertaking. We also do not come across these types of scenarios with the research we do a lot, thus our search for a simple solution via MPI to get through the research projects that do require it.

0 Kudos
PrasanthD_intel
Moderator
2,134 Views

Hi Kyle,

We built on top of your scenario. As you have mentioned the following kind of pinning and node distribution,
ranks 12,19,99  ----> share CPU 1 on host 4
rank 55 ---------> get your own CPU 2 on host 4
completing the rest arbitrarily in the following manner, so as to demonstrate an example command line. Total ranks assumed = 100.

Node name	Ranks	                                           Physical cores	Ranks per node
S001-n057	0-11 (default core pinning)	                           24	          12 (under-subscribed)
S001-n015	13-18, 20-37 (default core pinning)                    24	          24 (fully-subscribed)
S001-n058	38-54, 56-74 (default core pinning)	                   24	          36 (over-subscribed)
S001-n016	C0-12,19,99 ; C1-55 ; C2 to C23-75-99 
             (user defined core pinning)	                       24	          28 (mildly over-subscribed)

 

We will be using argument sets to provision user defined rank to node mapping. And within a single node, we will achieve the desired rank to core mapping using I_MPI_PIN_PROCESSOR_LIST. As per the your request I have pinned ranks 12,19 and 99 to core 0, rank 55 to core 1 and core 2 to 23 have been assigned in a round robin fashion to ranks 75 to 99.

Following command line exactly achieves the above rank to core and rank to node mapping/pinning,
I_MPI_DEBUG=5 mpiexec.hydra -n 12 -host s001-n057 ./test : -n 1 -env I_MPI_PIN_PROCESSOR_LIST=0,0,1,2-23,2-3,0 -host s001-n016 ./test : -n 6 -host s001-n015 ./test : -n 1 -host s001-n016 ./test : -n 18 -host s001-n015 ./test : -n 17 -host s001-n058 ./test : -n 1 -host s001-n016 ./test : -n 19 -host s001-n058 ./test : -n 25 -host s001-n016 ./test

With the debug level 5 setting, one can confirm the core and node pinning. See below,
[0] MPI startup(): libfabric version: 1.10.0a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       28455    s001-n057  {0,12}
[0] MPI startup(): 1       28456    s001-n057  {1,13}
[0] MPI startup(): 2       28457    s001-n057  {2,14}
[0] MPI startup(): 3       28458    s001-n057  {3,15}
[0] MPI startup(): 4       28459    s001-n057  {4,16}
[0] MPI startup(): 5       28460    s001-n057  {5,17}
[0] MPI startup(): 6       28461    s001-n057  {6,18}
[0] MPI startup(): 7       28462    s001-n057  {7,19}
[0] MPI startup(): 8       28463    s001-n057  {8,20}
[0] MPI startup(): 9       28464    s001-n057  {9,21}
[0] MPI startup(): 10      28465    s001-n057  {10,22}
[0] MPI startup(): 11      28466    s001-n057  {11,23}
[0] MPI startup(): 12      16797    s001-n016  {0}
[0] MPI startup(): 13      19425    s001-n015  {0}
[0] MPI startup(): 14      19426    s001-n015  {12}
[0] MPI startup(): 15      19427    s001-n015  {1}
[0] MPI startup(): 16      19428    s001-n015  {13}
[0] MPI startup(): 17      19429    s001-n015  {2}
[0] MPI startup(): 18      19430    s001-n015  {14}
[0] MPI startup(): 19      16798    s001-n016  {0}
[0] MPI startup(): 20      19431    s001-n015  {3}
[0] MPI startup(): 21      19432    s001-n015  {15}
[0] MPI startup(): 22      19433    s001-n015  {4}
[0] MPI startup(): 23      19434    s001-n015  {16}
[0] MPI startup(): 24      19435    s001-n015  {5}
[0] MPI startup(): 25      19436    s001-n015  {17}
[0] MPI startup(): 26      19437    s001-n015  {6}
[0] MPI startup(): 27      19438    s001-n015  {18}
[0] MPI startup(): 28      19439    s001-n015  {7}
[0] MPI startup(): 29      19440    s001-n015  {19}
[0] MPI startup(): 30      19441    s001-n015  {8}
[0] MPI startup(): 31      19442    s001-n015  {20}
[0] MPI startup(): 32      19443    s001-n015  {9}
[0] MPI startup(): 33      19444    s001-n015  {21}
[0] MPI startup(): 34      19445    s001-n015  {10}
[0] MPI startup(): 35      19446    s001-n015  {22}
[0] MPI startup(): 36      19447    s001-n015  {11}
[0] MPI startup(): 37      19448    s001-n015  {23}
[0] MPI startup(): 38      18619    s001-n058  {0}
[0] MPI startup(): 39      18620    s001-n058  {12}
[0] MPI startup(): 40      18621    s001-n058  {1}
[0] MPI startup(): 41      18622    s001-n058  {13}
[0] MPI startup(): 42      18623    s001-n058  {2}
[0] MPI startup(): 43      18624    s001-n058  {14}
[0] MPI startup(): 44      18625    s001-n058  {3}
[0] MPI startup(): 45      18626    s001-n058  {15}
[0] MPI startup(): 46      18627    s001-n058  {4}
[0] MPI startup(): 47      18628    s001-n058  {16}
[0] MPI startup(): 48      18629    s001-n058  {5}
[0] MPI startup(): 49      18630    s001-n058  {17}
[0] MPI startup(): 50      18631    s001-n058  {6}
[0] MPI startup(): 51      18632    s001-n058  {18}
[0] MPI startup(): 52      18633    s001-n058  {7}
[0] MPI startup(): 53      18634    s001-n058  {19}
[0] MPI startup(): 54      18635    s001-n058  {8}
[0] MPI startup(): 55      16799    s001-n016  {1}
[0] MPI startup(): 56      18636    s001-n058  {20}
[0] MPI startup(): 57      18637    s001-n058  {9}
[0] MPI startup(): 58      18638    s001-n058  {21}
[0] MPI startup(): 59      18639    s001-n058  {10}
[0] MPI startup(): 60      18640    s001-n058  {22}
[0] MPI startup(): 61      18641    s001-n058  {11}
[0] MPI startup(): 62      18642    s001-n058  {23}
[0] MPI startup(): 63      18643    s001-n058  {0}
[0] MPI startup(): 64      18644    s001-n058  {12}
[0] MPI startup(): 65      18645    s001-n058  {1}
[0] MPI startup(): 66      18646    s001-n058  {13}
[0] MPI startup(): 67      18647    s001-n058  {2}
[0] MPI startup(): 68      18648    s001-n058  {14}
[0] MPI startup(): 69      18649    s001-n058  {3}
[0] MPI startup(): 70      18650    s001-n058  {15}
[0] MPI startup(): 71      18651    s001-n058  {4}
[0] MPI startup(): 72      18652    s001-n058  {16}
[0] MPI startup(): 73      18653    s001-n058  {5}
[0] MPI startup(): 74      18654    s001-n058  {17}
[0] MPI startup(): 75      16800    s001-n016  {2}
[0] MPI startup(): 76      16801    s001-n016  {3}
[0] MPI startup(): 77      16802    s001-n016  {4}
[0] MPI startup(): 78      16803    s001-n016  {5}
[0] MPI startup(): 79      16804    s001-n016  {6}
[0] MPI startup(): 80      16805    s001-n016  {7}
[0] MPI startup(): 81      16806    s001-n016  {8}
[0] MPI startup(): 82      16807    s001-n016  {9}
[0] MPI startup(): 83      16808    s001-n016  {10}
[0] MPI startup(): 84      16809    s001-n016  {11}
[0] MPI startup(): 85      16810    s001-n016  {12}
[0] MPI startup(): 86      16811    s001-n016  {13}
[0] MPI startup(): 87      16812    s001-n016  {14}
[0] MPI startup(): 88      16813    s001-n016  {15}
[0] MPI startup(): 89      16814    s001-n016  {16}
[0] MPI startup(): 90      16815    s001-n016  {17}
[0] MPI startup(): 91      16816    s001-n016  {18}
[0] MPI startup(): 92      16817    s001-n016  {19}
[0] MPI startup(): 93      16818    s001-n016  {20}
[0] MPI startup(): 94      16819    s001-n016  {21}
[0] MPI startup(): 95      16820    s001-n016  {22}
[0] MPI startup(): 96      16821    s001-n016  {23}
[0] MPI startup(): 97      16822    s001-n016  {2}
[0] MPI startup(): 98      16823    s001-n016  {3}
[0] MPI startup(): 99      16824    s001-n016  {0}
[0] MPI startup(): I_MPI_ROOT=/glob/development-tools/versions/oneapi/beta05/inteloneapi/mpi/2021.1-beta05
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=5

It took us a few iterations to prepare this command line, but it works . I felt that the use of I_MPI_PIN_PROCESSOR_LIST in the context of argument sets is non-intuitive. The first usage for a certain node has an effect on the remaining sub-commands for that particular node. Which means ,when we set process pinning for node s001-n016 in first usage it follows the same pinning for all the subsequent usage of that particular node.

Let me know if you have any questions.

Regards

Prasanth

0 Kudos
PrasanthD_intel
Moderator
2,134 Views

Hi Kyle,

Could you please confirm whether the given command works for you.

Else you can reach out to us.

 

Prasanth

 

0 Kudos
PrasanthD_intel
Moderator
2,134 Views

Hi Kyle,

We are closing this thread considering your issue is resolved with the given solution.

Raise a new thread for any such further queries.

 

- Prasanth

0 Kudos
Reply