Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

MPI Processes - socket mapping and threads per process - core mapping

psing51
New Contributor I
8,178 Views

Hi,
I have a 2 socket 20 cores per socket (ntel(R) Xeon(R) Gold 6148 CPU) node .
I wish to launch 1 process per socket and 20 threads per process and if possible - all threads should be pinned to their respective cores.

earlier i used to run intel binaries on cray machine with similar cores , and the syntax was - 
aprun –n (mpi tasks) –N (tasks per node) –S (tasks per socket) –d (thread depth) <executable> , example - 

OMP_NUM_THREADS=20
aprun -n4 -N2 -S1 -d $OMP_NUM_THREADS ./a.out
 

node 0 socket 0 process#0 nprocs 4 thread id  0  nthreads 20  core id  0
node 0 socket 0 process#0 nprocs 4 thread id  1  nthreads 20  core id  1
....
node 0 socket 0 process#0 nprocs 4 thread id 19  nthreads 20  core id 19
node 0 socket 1 process#1 nprocs 4 thread id  0  nthreads 20  core id 20
...
node 0 socket 0 process#1 nprocs 4 thread id 19  nthreads 20  core id 39
....
node 1 socket 0 process#1 nprocs 4 thread id 19  nthreads 20  core id 39

 

How can i achieve the same/equivalent effect using intel's mpirun?

0 Kudos
11 Replies
Yury_K_Intel
Employee
8,178 Views

Hello,

Are you sure you use Intel MPI Library? I'm not sure we can assist you regarding Cray MPI use cases.

-

Best regards, Yury.

0 Kudos
psing51
New Contributor I
8,178 Views

Hi,
Yes i am using intel 19 update 2 on a "non cray machine" which had 2 sockets x20 cores= 40 cores
and my objective is to launch 1 process per socket, and all threads spawned per process should not spill out of a socket.
 

Cray's example was only a reference to what i am trying to achieve as Cray's aprun can control process distribution across node / socket and thread distribution per socket  - even when binary was compiled using "Intel compilers".


I have attached an image for more clarification on what i am trying to achieve. I just want to understand that is the intel's mpirun capable of distributing 1 MPI process per numa socket , and then contain threads spawned by the process within the socket (pinning the thread to its respective core - where it started executing).

an example would be very helpful.
Please let me know if more clarification is required on my query.

 

0 Kudos
James_C_Intel2
Employee
8,178 Views

Assuming you are using the Intel (or LLVM) OpenMP compilers (or forcing the use of their runtime even with GCC :-)) then as long as the MPI process startup mechanism sets the right affinity mask, the OpenMP runtime will simply respect that affinity mask and will start one thread for each logicalCPU in that mask. Of course, that may be 40T/socket if your machine is running the hyper-threading enabled so that you have 2T/C. If you only want one thread/core, then set KMP_HW_SUBSET=1T. (See Controlling Thread Allocation). You can also see where the threads have been placed by setting KMP_AFFINITY="verbose", which  may be worth doing to check what is happening initially! (Note that there is no need to set OMP_NUM_THREADS, which means there's one less thing to get wrong or have to change when moving to a different machine!)

0 Kudos
Alexey_M_Intel2
Employee
8,178 Views

Hi,

To distribute one rank per socket please use I_MPI_PIN_DOMAIN=socket environment variable. The analog for you command line will be the following:

export OMP_NUM_THREADS=20

export I_MPI_PIN_DOMAIN=socket

mpirun -n 4 -ppn 2 ./a.out

For details please refer documentation https://software.intel.com/en-us/download/mpi-developer-reference-linux chapter 3.4

Kind Regards,

Alexey

 

0 Kudos
psing51
New Contributor I
8,178 Views

Thanks @Cowney for insights, Hyper-threading is disabled on the machine.

Thanks @Alexey for the reply, I am using following code to check the thread distribution per socket.
 

#include <mpi.h>
#include <stdio.h>
#include<omp.h>
#include<sched.h>
int main(int argc, char** argv) {
    MPI_Init(NULL, NULL);

    int _psize,_tsize;
    int _hostname_len;
    int _pid,_tid;
    char _hostname[MPI_MAX_PROCESSOR_NAME];

    MPI_Comm_size(MPI_COMM_WORLD, &_psize);
    MPI_Comm_rank(MPI_COMM_WORLD, &_pid);
    MPI_Get_processor_name(_hostname, &_hostname_len);

#pragma omp parallel private(_tid,_tsize)
{
    _tid=omp_get_thread_num();
    _tsize=omp_get_num_threads();
    printf("\nnode %s, pid %d/%d ,tid %d/%d, core %d",_hostname,_pid,_psize,_tid,_tsize,sched_getcpu());
    fflush(stdout);
    sleep (4);
}
    MPI_Finalize();
}

 

Here is the stdout of a run - 

 I_MPI_PIN_DOMAIN=socket OMP_NUM_THREADS=20 mpirun -np 2 ./a.out |sort -n

                       node node01, pid 0/2 ,tid 0/20, core 38
node node01, pid 0/2 ,tid 10/20, core 26
node node01, pid 0/2 ,tid 11/20, core 24
node node01, pid 0/2 ,tid 1/20, core 16
node node01, pid 0/2 ,tid 12/20, core 0
node node01, pid 0/2 ,tid 13/20, core 12
node node01, pid 0/2 ,tid 14/20, core 10
node node01, pid 0/2 ,tid 15/20, core 8
node node01, pid 0/2 ,tid 16/20, core 4
node node01, pid 0/2 ,tid 17/20, core 18
node node01, pid 0/2 ,tid 18/20, core 0
node node01, pid 0/2 ,tid 19/20, core 2
node node01, pid 0/2 ,tid 2/20, core 14
node node01, pid 0/2 ,tid 3/20, core 16
node node01, pid 0/2 ,tid 4/20, core 10
node node01, pid 0/2 ,tid 5/20, core 28
node node01, pid 0/2 ,tid 6/20, core 0
node node01, pid 0/2 ,tid 7/20, core 22
node node01, pid 0/2 ,tid 8/20, core 6
node node01, pid 0/2 ,tid 9/20, core 2
                      node node01, pid 1/2 ,tid 0/20, core 39
node node01, pid 1/2 ,tid 10/20, core 3
node node01, pid 1/2 ,tid 11/20, core 1
node node01, pid 1/2 ,tid 1/20, core 17
node node01, pid 1/2 ,tid 12/20, core 1
node node01, pid 1/2 ,tid 13/20, core 13
node node01, pid 1/2 ,tid 14/20, core 11
node node01, pid 1/2 ,tid 15/20, core 9
node node01, pid 1/2 ,tid 16/20, core 5
node node01, pid 1/2 ,tid 17/20, core 23
node node01, pid 1/2 ,tid 18/20, core 21
node node01, pid 1/2 ,tid 19/20, core 19
node node01, pid 1/2 ,tid 2/20, core 15
node node01, pid 1/2 ,tid 3/20, core 13
node node01, pid 1/2 ,tid 4/20, core 9
node node01, pid 1/2 ,tid 5/20, core 5
node node01, pid 1/2 ,tid 6/20, core 1
node node01, pid 1/2 ,tid 7/20, core 3
node node01, pid 1/2 ,tid 8/20, core 7
node node01, pid 1/2 ,tid 9/20, core 25

 

with *PIN_DOMAIN=socket , i was expecting P1(pid 1/2 ,tid 0/20)  to run on one of core 0-19  and  P0(pid 0/2 ,tid 0/20)  on cores 20-39 or vice versa. But it seems p0 and p1 land up on same socket.


Do you see some issues with the code/methodology for this verification?. I tested with intel2018, results were similar
 

 

 

 

0 Kudos
James_C_Intel2
Employee
8,178 Views

I still recommend setting "KMP_AFFINITY=verbose". It will show you exactly what the OpenMP runtime is seeing and doing... (without any affinity the threads won't be tightly bound, so your code may show the same physical location from more than one thread if threads are migrating...).

0 Kudos
McCalpinJohn
Honored Contributor III
8,178 Views

In this scenario, calling "sched_getcpu()" has two problems: (1) It only tells you where the thread is running when it executes the call, and (2) calling a system routine could cause a thread to migrate.   Neither of these are problems if you know that your threads are each bound to a single core, but your configuration does not guarantee this.

In this case what you want the the scheduling affinity mask for each thread from "sched_getaffinity()".  Setting KMP_AFFINITY=verbose will provide the same information (formatted differently) on stderr when the program executes its first parallel section.

You also need the output of "lscpu" (or similar) to clarify whether this node numbers the cores in blocks [0-19,20-39] or alternating [even,odd] between sockets.

If you want a way to determine what core you are currently running on without calling the OS (and risking generating a rescheduling event), you can use the "full_rdtscp()" function from https://github.com/jdmccalpin/low-overhead-timers to extract the socket and core information from the IA32_TSC_AUX MSR.  (Every version of Linux that supports RDTSCP configures the IA32_TSC_AUX register on each logical processor to contain the correct socket number and core number.)

 

0 Kudos
Mikuchadze__George
8,178 Views

I have a server with two Xeon Gold 6148 cpus (20 cores on each, hyper-threading is enabled and one numa domain per socket). There is software (WRF-ARW) compiled in hybrid MPI / OpenMP mode, MPI is completely intel-based, which comes with parallel studio XE 2020 update1. I want to run wrf.exe on two processors, one MPI on each and 18 OpenMP threads for each MPI process. To do this, I do the following trick:

In my Bash script I have

  export I_MPI_PIN_DOMAIN=socket

  export KMP_HW_SUBSET=18c,1T

  KMP_AFFINITY=verbose,granularity=fine,compact

  mpiexec -n 2 ./wrf.exe

For this case the processes stop with errors from the beginning, but if:

 export KMP_HW_SUBSET=9c,1T

 mpiexec -n 4 ./wrf.exe

that works.

Really I need 18c, (1T or 2T) and one MPI per cpu. How to solve the problem?

 

0 Kudos
McCalpinJohn
Honored Contributor III
8,178 Views

When you are working with Intel MPI, it is probably best to use the MPI binding variables and not KMP_HW_SUBSET....

I have not used Intel 2020 yet, but through 2019 I have had no trouble with the defaults -- if you put two MPI tasks on a 2-socket system, it bind each task (and its underlying OpenMP threads) to a separate socket.  The number of OpenMP threads should be set by OMP_NUM_THREADS=18. 

The only thing that remains is ensuring the OpenMP threads are scheduled one per core.  I use OMP_PROC_BIND=spread for this, but KMP_AFFINITY=scatter would also work.  In my experience the MPI library does the right thing by default for hybrid MPI/OpenMP jobs with or without HyperThreading enabled, but it also cooperates with the OMP_PROC_BIND and KMP_AFFINITY variables.

0 Kudos
Mikuchadze__George
8,178 Views

Many thanks for your support "Dr. Bandwidth",

I followed Your advice and put the followng in my bash script:

ulimit -s unlimited

 export I_MPI_PIN_DOMAIN=socket
 export OMP_NUM_THREADS=18
 export OMP_PROC_BIND=spread
mpiexec -n 2 ./wrf.exe

In less then 5 sec processes ended with errors:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 183908 RUNNING AT localhost.localdomain
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 183909 RUNNING AT localhost.localdomain
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
I have no ideas!

0 Kudos
jimdempseyatthecove
Honored Contributor III
8,178 Views

Is your wrf.exe the test program listed in post #6?

If this is your application then are you aware that

   ulimit -s unlimited

applies only to the master thread, and not any OpenMP created threads.

Set the OpenMP created stack sizes using the environment variable:

OMP_STACKSIZE=nnnn[B|K|M|G|T] default is K

Or, with Intel compilers you can call kmp_set_stacksize_s(size)....
**** PRIOR TO FIRST PARALLEL REGION ****

Jim Dempsey

 

0 Kudos
Reply