- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Setup:
- I have a 2 socket computer. Each socket has an 18 core CPU.
- My code runs as MPMD (There is a master rank distributing independent work to all of the other ranks.)
- For this example lets say there are 500 tasks to complete.
- Due to the problem definition and the amount of memory on the machine I can only run 24 ranks at one time. This leaves 11 cores unused (36 - master - 24 = 11)
- Execution of a single problem can be sped up by running on more cores so I Would like to use 10 of the 11 free cores.
Is there a way for me to run 10 ranks with 2 cores, and the remaining 14 ranks with 1 core while getting the processor affinity correct?
Since I am undersubscribing the affinities for some processes get "3" cores. Default mpi process affinity pinning for this situation gives poor performance (the speed of the dual core jobs matches the single core speed). Ideally the processes would lay out as 5 dual core and 7 single core ranks per socket and the jobs stay within the socket. I quickly tried a few different setups using the -configfile option, but I couldn't figure out how to get what I wanted. I'm controlling threads using MKL_NUM_THREADS.
Example of my -configfile:
# Master
-n 1 ./someExecutable
# Workers
-n 10 -env MKL_NUM_THREADS 2 ./someExecutable
-n 14 -env MKL_NUM_THREADS 1 ./someExecutable
Also tried this:
# Master
-n 1 ./someExecutable
# Workers
-n 5 -env MKL_NUM_THREADS 2 ./someExecutable
-n 7 -env MKL_NUM_THREADS 1 ./someExecutable
-n 5 -env MKL_NUM_THREADS 2 ./someExecutable
-n 7 -env MKL_NUM_THREADS 1 ./someExecutable
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @JeffS
couple of questions here:
1) do you link with -qmkl=parallel and -qopenmp?
2) do you have some OpenMP in your code?
3) can you provide the output of mpirun -genv I_MPI_DEBUG=10 "your config file other arguments here"
Best
Tobias
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- -qmkl=parallel yes. -qopenmp no
- I don't have any OpenMP in my code. The only "parallel" portions are MKL calls.
- Unfortunately I cannot provide this. What would I be looking for here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@JeffS
If you cannot provide the debug output, I cannot help you.
You may check with MKL_VERBOSE=1 if the MKL functions you are using are really threaded
You also may check with KMP_AFFINITY=verbose and OMP_DISPLAY_ENV=true where the threads are pinned.
(qmkl=parallel should use the OpenMP threading layer)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The MKL calls are definitely parallel and obey the MKL_NUM_THREADS environment variable. I've used them plenty for this code.
I am capturing thread affinity in my code and outputting it. Some workers get 2 threads and some get 3 since the number of cores is not even divisible by the number of workers. Is there an easy way to set the number of cores given to a worker?
Maybe I need to pin each group of workers in my config file to be constrained to a single socket?
How do I accomplish the below.
- Total resources: 256GB total RAM, 2CPUS with 18 cores each
- Resources Per Socket: 128GB and 18 cores.
- For the jobs running on a socket I want to make sure the memory stays under the limit and does overflow into the other sockets memory....
- Ideal setup (on each socket)
- 5 jobs using 2 cores
- 7 jobs using single core
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Again, without the debug output (+ lscpu) I am not able to help you.
You may try to define a pinning mask using the pinning simulator:
https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library-pinning-simulator.html
and use the "Masklist Editing Mode" after defining your node configuration in step 2.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting tool. I hadn't seen that before.
In my original setup I was only getting 1 core per worker which is why it ran slow.
The below accomplishes what I want. Maybe not the most explicit/best way to get it done but the workers with 2 threads do run twice as fast.
Set pinning to numa.
export I_MPI_PIN=1
export I_MPI_PIN_FOMAIN=numa
# Master
-n 1 ./someExecutable
# Workers
-n 5 -env MKL_NUM_THREADS 2 ./someExecutable
-n 7 -env MKL_NUM_THREADS 1 ./someExecutable
-n 5 -env MKL_NUM_THREADS 2 ./someExecutable
-n 7 -env MKL_NUM_THREADS 1 ./someExecutable
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page