Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Multiple IntelMPI Jobs On A Single Node using Slurm

David_Touati
Beginner
1,008 Views

I have a cluster single node of 4 socket total 32 corers. The systems running Redhat 6.3, and intelmpi 4 update 3. I am using Slurm to start mpi jobs. It seems that whenever I try to run multiple MPI jobs to a single node all the jobs end up running on the same processors. Moreover i notice that the job use all the cores in the node. For example: i started with the first mpi job using Slurm on the node with 8 cores; and i notice that the first mpi task run on 0 to 3 cpus, the2-ndmpi task on 4-7 cpus, and so on the last task on 28-31. Each mpi task used 4 cores instead 1. i started the 2-nd job with 8 cores, and i notice the same and they run on the same 32 cpus of the first job.

there a way to tell mpirun using Slurm to set the taskset affinity correctly at each run so that it will choose only the idle processors according the Slurm?
Thanks.

0 Kudos
4 Replies
TimP
Honored Contributor III
1,008 Views

As far as I know, you must set I_MPI_PIN_DOMAIN=off in order for this to work at all.  If you can demonstrate value for receiving CPU assignments from slurm, you might file a feature request.   I don't thnk spllitting it down to the core level for separate jobs is likely to work well.  Maybe splitting down to socket level could be useful.  You could make a case that clusters with nodes of 4 or more CPUs will be more valuable with such a feature. 

If your request turns out to be out of the main stream, you might have to script it yourself, using KMP_AFFINITY or the OpenMP 4.0 equivalent to assign cores to each job.

0 Kudos
James_T_Intel
Moderator
1,008 Views

Hi David,

If you are using cpuset, the current version of the Intel® MPI Library does not support it.  The next release will, so if that is the case, just sit tight for a bit longer.

If not, let me know and we'll work from there.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
David_Touati
Beginner
1,008 Views

Thanks james,

I am not using cpuset. I assume that slume that will do the work.

0 Kudos
James_T_Intel
Moderator
1,008 Views

Hi David,

I misunderstood your original question, so let's change approach.  We do not currently check resource utilization from the job manager.  Internally, we typically reserve an entire node for a single job when running, as two different MPI jobs do not communicate with each other.

At present, the only way to do this is manually.  You'll need to get a list of available cores from SLURM*.  Is your application single or multi-threaded?

If single threaded, then you'll set I_MPI_PIN_PROCESSOR_LIST to match the available (and desired) cores, with one rank going to each core.  This will define a single core for each rank to use.

If multi-threaded, then you'll set I_MPI_PIN_DOMAIN instead.  This will set a group of cores available for each rank, and you'll use KMP_AFFINITY to control the thread placement within that domain.

There are quite a few syntax options for each of these variables, so please check the Reference Manual for full details.

As Tim said, if you're interested, I can file a feature request for this capability.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Reply