Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2218 Discussions

Heterogeneous MPI thread affinity/cores?

JeffS
Novice
1,288 Views

Setup:

  • I have a 2 socket computer.  Each socket has an 18 core CPU.
  • My code runs as MPMD (There is a master rank distributing independent work to all of the other ranks.)
  • For this example lets say there are 500 tasks to complete.
  • Due to the problem definition and the amount of memory on the machine I can only run 24 ranks at one time.  This leaves 11 cores unused (36 - master - 24 = 11)
  • Execution of a single problem can be sped up by running on more cores so I Would like to use 10 of the 11 free cores.

Is there a way for me to run 10 ranks with 2 cores, and the remaining 14 ranks with 1 core while getting the processor affinity correct? 

 

Since I am undersubscribing the affinities for some processes get "3" cores.   Default mpi process affinity pinning for this situation gives poor performance (the speed of the dual core jobs matches the single core speed).  Ideally the processes would lay out as 5 dual core and 7 single core ranks per socket and the jobs stay within the socket.  I quickly tried a few different setups using the -configfile option, but I couldn't figure out how to get what I wanted.  I'm controlling threads using MKL_NUM_THREADS.

 

Example of my -configfile:

# Master

-n 1 ./someExecutable

# Workers

-n 10 -env MKL_NUM_THREADS 2 ./someExecutable

-n 14 -env MKL_NUM_THREADS 1 ./someExecutable

 

Also tried this:

# Master

-n 1 ./someExecutable

# Workers

-n 5 -env MKL_NUM_THREADS 2 ./someExecutable

-n 7 -env MKL_NUM_THREADS 1 ./someExecutable

-n 5 -env MKL_NUM_THREADS 2 ./someExecutable

-n 7 -env MKL_NUM_THREADS 1 ./someExecutable

0 Kudos
6 Replies
TobiasK
Moderator
1,246 Views

Hi @JeffS 
couple of questions here:
1) do you link with -qmkl=parallel and -qopenmp?
2) do you have some OpenMP in your code?
3) can you provide the output of mpirun -genv I_MPI_DEBUG=10 "your config file other arguments here"

Best
Tobias

0 Kudos
JeffS
Novice
1,240 Views
  1. -qmkl=parallel yes.  -qopenmp no
  2. I don't have any OpenMP in my code.  The only "parallel" portions are MKL calls.
  3. Unfortunately I cannot provide this.  What would I be looking for here?
0 Kudos
TobiasK
Moderator
1,236 Views

@JeffS 
If you cannot provide the debug output, I cannot help you.

You may check with MKL_VERBOSE=1 if the MKL functions you are using are really threaded
You also may check with KMP_AFFINITY=verbose and OMP_DISPLAY_ENV=true where the threads are pinned.
(qmkl=parallel should use the OpenMP threading layer)

0 Kudos
JeffS
Novice
1,232 Views

The MKL calls are definitely parallel and obey the MKL_NUM_THREADS environment variable.  I've used them plenty for this code.

 

I am capturing thread affinity in my code and outputting it.  Some workers get 2 threads and some get 3 since the number of cores is not even divisible by the number of workers.  Is there an easy way to set the number of cores given to a worker?  

 

Maybe I need to pin each group of workers in my config file to be constrained to a single socket?

 

How do I accomplish the below.

  • Total resources: 256GB total RAM, 2CPUS with 18 cores each
  • Resources Per Socket: 128GB and 18 cores.
    • For the jobs running on a socket I want to make sure the memory stays under the limit and does overflow into the other sockets memory....
  • Ideal setup (on each socket)
    • 5 jobs using 2 cores
    • 7 jobs using single core
0 Kudos
TobiasK
Moderator
1,222 Views

Again, without the debug output (+ lscpu) I am not able to help you.

You may try to define a pinning mask using the pinning simulator:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library-pinning-simulator.html
and use the "Masklist Editing Mode" after defining your node configuration in step 2.

0 Kudos
JeffS
Novice
1,203 Views

Interesting tool.  I hadn't seen that before.

 

In my original setup I was only getting 1 core per worker which is why it ran slow.

 

The below accomplishes what I want.  Maybe not the most explicit/best way to get it done but the workers with 2 threads do run twice as fast.

 

Set pinning to numa. 

export I_MPI_PIN=1

export I_MPI_PIN_FOMAIN=numa

 

# Master

-n 1 ./someExecutable

# Workers

-n 5 -env MKL_NUM_THREADS 2 ./someExecutable

-n 7 -env MKL_NUM_THREADS 1 ./someExecutable

-n 5 -env MKL_NUM_THREADS 2 ./someExecutable

-n 7 -env MKL_NUM_THREADS 1 ./someExecutable

 

 

0 Kudos
Reply