Thanks @Cowney for insights.

psing51 · ‎03-24-2019

Hi,
I have a 2 socket 20 cores per socket (ntel(R) Xeon(R) Gold 6148 CPU) node .
I wish to launch 1 process per socket and 20 threads per process and if possible - all threads should be pinned to their respective cores.

earlier i used to run intel binaries on cray machine with similar cores , and the syntax was -
aprun –n (mpi tasks) –N (tasks per node) –S (tasks per socket) –d (thread depth) <executable> , example -

OMP_NUM_THREADS=20
aprun -n4 -N2 -S1 -d $OMP_NUM_THREADS ./a.out

node 0 socket 0 process#0 nprocs 4 thread id  0  nthreads 20  core id  0
node 0 socket 0 process#0 nprocs 4 thread id  1  nthreads 20  core id  1
....
node 0 socket 0 process#0 nprocs 4 thread id 19  nthreads 20  core id 19
node 0 socket 1 process#1 nprocs 4 thread id  0  nthreads 20  core id 20
...
node 0 socket 0 process#1 nprocs 4 thread id 19  nthreads 20  core id 39
....
node 1 socket 0 process#1 nprocs 4 thread id 19  nthreads 20  core id 39

How can i achieve the same/equivalent effect using intel's mpirun?

Yury_K_Intel · ‎03-24-2019

Hello,

Are you sure you use Intel MPI Library? I'm not sure we can assist you regarding Cray MPI use cases.

-

Best regards, Yury.

psing51 · ‎03-25-2019

Hi,
Yes i am using intel 19 update 2 on a "non cray machine" which had 2 sockets x20 cores= 40 cores
and my objective is to launch 1 process per socket, and all threads spawned per process should not spill out of a socket.

Cray's example was only a reference to what i am trying to achieve as Cray's aprun can control process distribution across node / socket and thread distribution per socket - even when binary was compiled using "Intel compilers".

I have attached an image for more clarification on what i am trying to achieve. I just want to understand that is the intel's mpirun capable of distributing 1 MPI process per numa socket , and then contain threads spawned by the process within the socket (pinning the thread to its respective core - where it started executing).

an example would be very helpful.
Please let me know if more clarification is required on my query.

James_C_Intel2 · ‎03-25-2019

Assuming you are using the Intel (or LLVM) OpenMP compilers (or forcing the use of their runtime even with GCC :-)) then as long as the MPI process startup mechanism sets the right affinity mask, the OpenMP runtime will simply respect that affinity mask and will start one thread for each logicalCPU in that mask. Of course, that may be 40T/socket if your machine is running the hyper-threading enabled so that you have 2T/C. If you only want one thread/core, then set KMP_HW_SUBSET=1T. (See Controlling Thread Allocation). You can also see where the threads have been placed by setting KMP_AFFINITY="verbose", which may be worth doing to check what is happening initially! (Note that there is no need to set OMP_NUM_THREADS, which means there's one less thing to get wrong or have to change when moving to a different machine!)

Alexey_M_Intel2 · ‎03-25-2019

Hi,

To distribute one rank per socket please use I_MPI_PIN_DOMAIN=socket environment variable. The analog for you command line will be the following:

export OMP_NUM_THREADS=20

export I_MPI_PIN_DOMAIN=socket

mpirun -n 4 -ppn 2 ./a.out

For details please refer documentation https://software.intel.com/en-us/download/mpi-developer-reference-linux chapter 3.4

Kind Regards,

Alexey

psing51 · ‎03-26-2019

Thanks @Cowney for insights, Hyper-threading is disabled on the machine.

Thanks @Alexey for the reply, I am using following code to check the thread distribution per socket.

#include <mpi.h>
#include <stdio.h>
#include<omp.h>
#include<sched.h>
int main(int argc, char** argv) {
    MPI_Init(NULL, NULL);

    int _psize,_tsize;
    int _hostname_len;
    int _pid,_tid;
    char _hostname[MPI_MAX_PROCESSOR_NAME];

    MPI_Comm_size(MPI_COMM_WORLD, &_psize);
    MPI_Comm_rank(MPI_COMM_WORLD, &_pid);
    MPI_Get_processor_name(_hostname, &_hostname_len);

#pragma omp parallel private(_tid,_tsize)
{
    _tid=omp_get_thread_num();
    _tsize=omp_get_num_threads();
    printf("\nnode %s, pid %d/%d ,tid %d/%d, core %d",_hostname,_pid,_psize,_tid,_tsize,sched_getcpu());
    fflush(stdout);
    sleep (4);
}
    MPI_Finalize();
}

Here is the stdout of a run -

 I_MPI_PIN_DOMAIN=socket OMP_NUM_THREADS=20 mpirun -np 2 ./a.out |sort -n

                       node node01, pid 0/2 ,tid 0/20, core 38
node node01, pid 0/2 ,tid 10/20, core 26
node node01, pid 0/2 ,tid 11/20, core 24
node node01, pid 0/2 ,tid 1/20, core 16
node node01, pid 0/2 ,tid 12/20, core 0
node node01, pid 0/2 ,tid 13/20, core 12
node node01, pid 0/2 ,tid 14/20, core 10
node node01, pid 0/2 ,tid 15/20, core 8
node node01, pid 0/2 ,tid 16/20, core 4
node node01, pid 0/2 ,tid 17/20, core 18
node node01, pid 0/2 ,tid 18/20, core 0
node node01, pid 0/2 ,tid 19/20, core 2
node node01, pid 0/2 ,tid 2/20, core 14
node node01, pid 0/2 ,tid 3/20, core 16
node node01, pid 0/2 ,tid 4/20, core 10
node node01, pid 0/2 ,tid 5/20, core 28
node node01, pid 0/2 ,tid 6/20, core 0
node node01, pid 0/2 ,tid 7/20, core 22
node node01, pid 0/2 ,tid 8/20, core 6
node node01, pid 0/2 ,tid 9/20, core 2
                      node node01, pid 1/2 ,tid 0/20, core 39
node node01, pid 1/2 ,tid 10/20, core 3
node node01, pid 1/2 ,tid 11/20, core 1
node node01, pid 1/2 ,tid 1/20, core 17
node node01, pid 1/2 ,tid 12/20, core 1
node node01, pid 1/2 ,tid 13/20, core 13
node node01, pid 1/2 ,tid 14/20, core 11
node node01, pid 1/2 ,tid 15/20, core 9
node node01, pid 1/2 ,tid 16/20, core 5
node node01, pid 1/2 ,tid 17/20, core 23
node node01, pid 1/2 ,tid 18/20, core 21
node node01, pid 1/2 ,tid 19/20, core 19
node node01, pid 1/2 ,tid 2/20, core 15
node node01, pid 1/2 ,tid 3/20, core 13
node node01, pid 1/2 ,tid 4/20, core 9
node node01, pid 1/2 ,tid 5/20, core 5
node node01, pid 1/2 ,tid 6/20, core 1
node node01, pid 1/2 ,tid 7/20, core 3
node node01, pid 1/2 ,tid 8/20, core 7
node node01, pid 1/2 ,tid 9/20, core 25

with *PIN_DOMAIN=socket , i was expecting P1(pid 1/2 ,tid 0/20) to run on one of core 0-19 and P0(pid 0/2 ,tid 0/20) on cores 20-39 or vice versa. But it seems p0 and p1 land up on same socket.

Do you see some issues with the code/methodology for this verification?. I tested with intel2018, results were similar

James_C_Intel2 · ‎03-27-2019

I still recommend setting "KMP_AFFINITY=verbose". It will show you exactly what the OpenMP runtime is seeing and doing... (without any affinity the threads won't be tightly bound, so your code may show the same physical location from more than one thread if threads are migrating...).

McCalpinJohn · ‎03-27-2019

In this scenario, calling "sched_getcpu()" has two problems: (1) It only tells you where the thread is running when it executes the call, and (2) calling a system routine could cause a thread to migrate. Neither of these are problems if you know that your threads are each bound to a single core, but your configuration does not guarantee this.

In this case what you want the the scheduling affinity mask for each thread from "sched_getaffinity()". Setting KMP_AFFINITY=verbose will provide the same information (formatted differently) on stderr when the program executes its first parallel section.

You also need the output of "lscpu" (or similar) to clarify whether this node numbers the cores in blocks [0-19,20-39] or alternating [even,odd] between sockets.

If you want a way to determine what core you are currently running on without calling the OS (and risking generating a rescheduling event), you can use the "full_rdtscp()" function from https://github.com/jdmccalpin/low-overhead-timers to extract the socket and core information from the IA32_TSC_AUX MSR. (Every version of Linux that supports RDTSCP configures the IA32_TSC_AUX register on each logical processor to contain the correct socket number and core number.)

Mikuchadze__George · ‎04-15-2020

I have a server with two Xeon Gold 6148 cpus (20 cores on each, hyper-threading is enabled and one numa domain per socket). There is software (WRF-ARW) compiled in hybrid MPI / OpenMP mode, MPI is completely intel-based, which comes with parallel studio XE 2020 update1. I want to run wrf.exe on two processors, one MPI on each and 18 OpenMP threads for each MPI process. To do this, I do the following trick:

In my Bash script I have

export I_MPI_PIN_DOMAIN=socket

export KMP_HW_SUBSET=18c,1T

KMP_AFFINITY=verbose,granularity=fine,compact

mpiexec -n 2 ./wrf.exe

For this case the processes stop with errors from the beginning, but if:

export KMP_HW_SUBSET=9c,1T

mpiexec -n 4 ./wrf.exe

that works.

Really I need 18c, (1T or 2T) and one MPI per cpu. How to solve the problem?

McCalpinJohn · ‎04-16-2020

When you are working with Intel MPI, it is probably best to use the MPI binding variables and not KMP_HW_SUBSET....

I have not used Intel 2020 yet, but through 2019 I have had no trouble with the defaults -- if you put two MPI tasks on a 2-socket system, it bind each task (and its underlying OpenMP threads) to a separate socket. The number of OpenMP threads should be set by OMP_NUM_THREADS=18.

The only thing that remains is ensuring the OpenMP threads are scheduled one per core. I use OMP_PROC_BIND=spread for this, but KMP_AFFINITY=scatter would also work. In my experience the MPI library does the right thing by default for hybrid MPI/OpenMP jobs with or without HyperThreading enabled, but it also cooperates with the OMP_PROC_BIND and KMP_AFFINITY variables.

Mikuchadze__George · ‎04-22-2020

Many thanks for your support "Dr. Bandwidth",

I followed Your advice and put the followng in my bash script:

ulimit -s unlimited

export I_MPI_PIN_DOMAIN=socket
export OMP_NUM_THREADS=18
export OMP_PROC_BIND=spread
mpiexec -n 2 ./wrf.exe

In less then 5 sec processes ended with errors:

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 183908 RUNNING AT localhost.localdomain
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 183909 RUNNING AT localhost.localdomain
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
I have no ideas!

jimdempseyatthecove · ‎04-23-2020

Is your wrf.exe the test program listed in post #6?

If this is your application then are you aware that

ulimit -s unlimited

applies only to the master thread, and not any OpenMP created threads.

Set the OpenMP created stack sizes using the environment variable:

OMP_STACKSIZE=nnnn[B|K|M|G|T] default is K

Or, with Intel compilers you can call kmp_set_stacksize_s(size)....
**** PRIOR TO FIRST PARALLEL REGION ****

Jim Dempsey

MPI Processes - socket mapping and threads per process - core mapping