Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Multiple MPI Jobs On A Single Node

hiewnh
Beginner
2,261 Views

I have a cluster of 8-sock quad core systems running Redhat 5.2. It seems that whenever I try to run multiple MPI jobs to a single node all the jobs end up running on the same processors. For example, if I were to submit 4 8-way jobs to a single box they all end up in CPUs 0 to 7, leaving 8 to 31 idle.

I then tried all sorts of I_MPI_PIN_PROCESSOR_LIST combinations but short of explicitly listing out the processors at each run, they all end up still hanging on to CPUs 0-7. Browsing through the mpiexec script, I realise that it is doing a taskset on each run.
As my jobs are all submitted through a scheduler (PBS in this case) I cannot possibly know at job submission time which CPUs are not used. So is there a simple way to tell mpiexec to set the taskset affinity correctly at each run so that it will choose only the idle processors?
Thanks.

0 Kudos
3 Replies
draceswbell_net
Beginner
2,261 Views
I have a cluster of 8-sock quad core systems running Redhat 5.2. It seems that whenever I try to run multiple MPI jobs to a single node all the jobs end up running on the same processors. For example, if I were to submit 4 8-way jobs to a single box they all end up in CPUs 0 to 7, leaving 8 to 31 idle.

I then tried all sorts of I_MPI_PIN_PROCESSOR_LIST combinations but short of explicitly listing out the processors at each run, they all end up still hanging on to CPUs 0-7. Browsing through the mpiexec script, I realise that it is doing a taskset on each run.
As my jobs are all submitted through a scheduler (PBS in this case) I cannot possibly know at job submission time which CPUs are not used. So is there a simple way to tell mpiexec to set the taskset affinity correctly at each run so that it will choose only the idle processors?
Thanks.

Use the "-genv I_MPI_PIN disable" to resolve the immediate problem of multiple jobs pinning to the same cores. We use SGE at our site, but the root issue remains the same.The interaction between the scheduler and MPI might need a little better definitionwith current systems. If performance is a big issue, then you might want to consider only allocating using all of the cores on a compute node.

At the very least, you probably should make this a site wide default if your scheduler will keep assigning partial nodes to jobs.
0 Kudos
Gergana_S_Intel
Employee
2,261 Views
Hi,

Certainly, disabling process pinning altogether (by setting I_MPI_PIN=off) is a viable option.

Another workaround we recommend is to let Intel MPI Library define processor domains for your system but let the OS take over in pinning to available "free" cores. To do so, you need to simply set I_MPI_PIN_DOMAIN=auto. You can either do so for all jobs on the node, or for each subsequent job (job 1 will still be pinned to cores 0-7).

What's really going on behind the scenes is that, since domains are defined as #cores/#procs, we're setting the #cores here to be equal to the #procs (so you have 1 core per domain).

Note that you can only use this if you have Intel MPI Library 3.1 Build 038 or newer.

I hope this helps. Let me know if this improves the situation.

Regards,
~Gergana
0 Kudos
zhubq
Beginner
2,261 Views

Hi Gergana,

Could you please look at my post http://software.intel.com/en-us/forums/topic/365457

Thank you.

Benqiang

Gergana Slavova (Intel) wrote:
Hi,

Certainly, disabling process pinning altogether (by setting I_MPI_PIN=off) is a viable option.

Another workaround we recommend is to let Intel MPI Library define processor domains for your system but let the OS take over in pinning to available "free" cores. To do so, you need to simply set I_MPI_PIN_DOMAIN=auto. You can either do so for all jobs on the node, or for each subsequent job (job 1 will still be pinned to cores 0-7).

What's really going on behind the scenes is that, since domains are defined as #cores/#procs, we're setting the #cores here to be equal to the #procs (so you have 1 core per domain).

Note that you can only use this if you have Intel MPI Library 3.1 Build 038 or newer.

I hope this helps. Let me know if this improves the situation.

Regards,
~Gergana

0 Kudos
Reply