- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a 2 socket 20 cores per socket (ntel(R) Xeon(R) Gold 6148 CPU) node .
I wish to launch 1 process per socket and 20 threads per process and if possible - all threads should be pinned to their respective cores.
earlier i used to run intel binaries on cray machine with similar cores , and the syntax was -
aprun –n (mpi tasks) –N (tasks per node) –S (tasks per socket) –d (thread depth) <executable> , example -
OMP_NUM_THREADS=20
aprun -n4 -N2 -S1 -d $OMP_NUM_THREADS ./a.out
node 0 socket 0 process#0 nprocs 4 thread id 0 nthreads 20 core id 0 node 0 socket 0 process#0 nprocs 4 thread id 1 nthreads 20 core id 1 .... node 0 socket 0 process#0 nprocs 4 thread id 19 nthreads 20 core id 19 node 0 socket 1 process#1 nprocs 4 thread id 0 nthreads 20 core id 20 ... node 0 socket 0 process#1 nprocs 4 thread id 19 nthreads 20 core id 39 .... node 1 socket 0 process#1 nprocs 4 thread id 19 nthreads 20 core id 39
How can i achieve the same/equivalent effect using intel's mpirun?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Are you sure you use Intel MPI Library? I'm not sure we can assist you regarding Cray MPI use cases.
-
Best regards, Yury.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Yes i am using intel 19 update 2 on a "non cray machine" which had 2 sockets x20 cores= 40 cores
and my objective is to launch 1 process per socket, and all threads spawned per process should not spill out of a socket.
Cray's example was only a reference to what i am trying to achieve as Cray's aprun can control process distribution across node / socket and thread distribution per socket - even when binary was compiled using "Intel compilers".
I have attached an image for more clarification on what i am trying to achieve. I just want to understand that is the intel's mpirun capable of distributing 1 MPI process per numa socket , and then contain threads spawned by the process within the socket (pinning the thread to its respective core - where it started executing).
an example would be very helpful.
Please let me know if more clarification is required on my query.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Assuming you are using the Intel (or LLVM) OpenMP compilers (or forcing the use of their runtime even with GCC :-)) then as long as the MPI process startup mechanism sets the right affinity mask, the OpenMP runtime will simply respect that affinity mask and will start one thread for each logicalCPU in that mask. Of course, that may be 40T/socket if your machine is running the hyper-threading enabled so that you have 2T/C. If you only want one thread/core, then set KMP_HW_SUBSET=1T. (See Controlling Thread Allocation). You can also see where the threads have been placed by setting KMP_AFFINITY="verbose", which may be worth doing to check what is happening initially! (Note that there is no need to set OMP_NUM_THREADS, which means there's one less thing to get wrong or have to change when moving to a different machine!)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
To distribute one rank per socket please use I_MPI_PIN_DOMAIN=socket environment variable. The analog for you command line will be the following:
export OMP_NUM_THREADS=20
export I_MPI_PIN_DOMAIN=socket
mpirun -n 4 -ppn 2 ./a.out
For details please refer documentation https://software.intel.com/en-us/download/mpi-developer-reference-linux chapter 3.4
Kind Regards,
Alexey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Cowney for insights, Hyper-threading is disabled on the machine.
Thanks @Alexey for the reply, I am using following code to check the thread distribution per socket.
#include <mpi.h> #include <stdio.h> #include<omp.h> #include<sched.h> int main(int argc, char** argv) { MPI_Init(NULL, NULL); int _psize,_tsize; int _hostname_len; int _pid,_tid; char _hostname[MPI_MAX_PROCESSOR_NAME]; MPI_Comm_size(MPI_COMM_WORLD, &_psize); MPI_Comm_rank(MPI_COMM_WORLD, &_pid); MPI_Get_processor_name(_hostname, &_hostname_len); #pragma omp parallel private(_tid,_tsize) { _tid=omp_get_thread_num(); _tsize=omp_get_num_threads(); printf("\nnode %s, pid %d/%d ,tid %d/%d, core %d",_hostname,_pid,_psize,_tid,_tsize,sched_getcpu()); fflush(stdout); sleep (4); } MPI_Finalize(); }
Here is the stdout of a run -
I_MPI_PIN_DOMAIN=socket OMP_NUM_THREADS=20 mpirun -np 2 ./a.out |sort -n node node01, pid 0/2 ,tid 0/20, core 38 node node01, pid 0/2 ,tid 10/20, core 26 node node01, pid 0/2 ,tid 11/20, core 24 node node01, pid 0/2 ,tid 1/20, core 16 node node01, pid 0/2 ,tid 12/20, core 0 node node01, pid 0/2 ,tid 13/20, core 12 node node01, pid 0/2 ,tid 14/20, core 10 node node01, pid 0/2 ,tid 15/20, core 8 node node01, pid 0/2 ,tid 16/20, core 4 node node01, pid 0/2 ,tid 17/20, core 18 node node01, pid 0/2 ,tid 18/20, core 0 node node01, pid 0/2 ,tid 19/20, core 2 node node01, pid 0/2 ,tid 2/20, core 14 node node01, pid 0/2 ,tid 3/20, core 16 node node01, pid 0/2 ,tid 4/20, core 10 node node01, pid 0/2 ,tid 5/20, core 28 node node01, pid 0/2 ,tid 6/20, core 0 node node01, pid 0/2 ,tid 7/20, core 22 node node01, pid 0/2 ,tid 8/20, core 6 node node01, pid 0/2 ,tid 9/20, core 2 node node01, pid 1/2 ,tid 0/20, core 39 node node01, pid 1/2 ,tid 10/20, core 3 node node01, pid 1/2 ,tid 11/20, core 1 node node01, pid 1/2 ,tid 1/20, core 17 node node01, pid 1/2 ,tid 12/20, core 1 node node01, pid 1/2 ,tid 13/20, core 13 node node01, pid 1/2 ,tid 14/20, core 11 node node01, pid 1/2 ,tid 15/20, core 9 node node01, pid 1/2 ,tid 16/20, core 5 node node01, pid 1/2 ,tid 17/20, core 23 node node01, pid 1/2 ,tid 18/20, core 21 node node01, pid 1/2 ,tid 19/20, core 19 node node01, pid 1/2 ,tid 2/20, core 15 node node01, pid 1/2 ,tid 3/20, core 13 node node01, pid 1/2 ,tid 4/20, core 9 node node01, pid 1/2 ,tid 5/20, core 5 node node01, pid 1/2 ,tid 6/20, core 1 node node01, pid 1/2 ,tid 7/20, core 3 node node01, pid 1/2 ,tid 8/20, core 7 node node01, pid 1/2 ,tid 9/20, core 25
with *PIN_DOMAIN=socket , i was expecting P1(pid 1/2 ,tid 0/20) to run on one of core 0-19 and P0(pid 0/2 ,tid 0/20) on cores 20-39 or vice versa. But it seems p0 and p1 land up on same socket.
Do you see some issues with the code/methodology for this verification?. I tested with intel2018, results were similar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I still recommend setting "KMP_AFFINITY=verbose". It will show you exactly what the OpenMP runtime is seeing and doing... (without any affinity the threads won't be tightly bound, so your code may show the same physical location from more than one thread if threads are migrating...).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In this scenario, calling "sched_getcpu()" has two problems: (1) It only tells you where the thread is running when it executes the call, and (2) calling a system routine could cause a thread to migrate. Neither of these are problems if you know that your threads are each bound to a single core, but your configuration does not guarantee this.
In this case what you want the the scheduling affinity mask for each thread from "sched_getaffinity()". Setting KMP_AFFINITY=verbose will provide the same information (formatted differently) on stderr when the program executes its first parallel section.
You also need the output of "lscpu" (or similar) to clarify whether this node numbers the cores in blocks [0-19,20-39] or alternating [even,odd] between sockets.
If you want a way to determine what core you are currently running on without calling the OS (and risking generating a rescheduling event), you can use the "full_rdtscp()" function from https://github.com/jdmccalpin/low-overhead-timers to extract the socket and core information from the IA32_TSC_AUX MSR. (Every version of Linux that supports RDTSCP configures the IA32_TSC_AUX register on each logical processor to contain the correct socket number and core number.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a server with two Xeon Gold 6148 cpus (20 cores on each, hyper-threading is enabled and one numa domain per socket). There is software (WRF-ARW) compiled in hybrid MPI / OpenMP mode, MPI is completely intel-based, which comes with parallel studio XE 2020 update1. I want to run wrf.exe on two processors, one MPI on each and 18 OpenMP threads for each MPI process. To do this, I do the following trick:
In my Bash script I have
export I_MPI_PIN_DOMAIN=socket
export KMP_HW_SUBSET=18c,1T
KMP_AFFINITY=verbose,granularity=fine,compact
mpiexec -n 2 ./wrf.exe
For this case the processes stop with errors from the beginning, but if:
export KMP_HW_SUBSET=9c,1T
mpiexec -n 4 ./wrf.exe
that works.
Really I need 18c, (1T or 2T) and one MPI per cpu. How to solve the problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you are working with Intel MPI, it is probably best to use the MPI binding variables and not KMP_HW_SUBSET....
I have not used Intel 2020 yet, but through 2019 I have had no trouble with the defaults -- if you put two MPI tasks on a 2-socket system, it bind each task (and its underlying OpenMP threads) to a separate socket. The number of OpenMP threads should be set by OMP_NUM_THREADS=18.
The only thing that remains is ensuring the OpenMP threads are scheduled one per core. I use OMP_PROC_BIND=spread for this, but KMP_AFFINITY=scatter would also work. In my experience the MPI library does the right thing by default for hybrid MPI/OpenMP jobs with or without HyperThreading enabled, but it also cooperates with the OMP_PROC_BIND and KMP_AFFINITY variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Many thanks for your support "Dr. Bandwidth",
I followed Your advice and put the followng in my bash script:
ulimit -s unlimited
export I_MPI_PIN_DOMAIN=socket
export OMP_NUM_THREADS=18
export OMP_PROC_BIND=spread
mpiexec -n 2 ./wrf.exe
In less then 5 sec processes ended with errors:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 183908 RUNNING AT localhost.localdomain
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 183909 RUNNING AT localhost.localdomain
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
I have no ideas!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is your wrf.exe the test program listed in post #6?
If this is your application then are you aware that
ulimit -s unlimited
applies only to the master thread, and not any OpenMP created threads.
Set the OpenMP created stack sizes using the environment variable:
OMP_STACKSIZE=nnnn[B|K|M|G|T] default is K
Or, with Intel compilers you can call kmp_set_stacksize_s(size)....
**** PRIOR TO FIRST PARALLEL REGION ****
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page