- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear mic forum,
What I'd like to do is to split my 60-core coprocessor into 4 domains, pin one MPI process to each domain, and in each process let 60 threads be bound to 60 logical cores.
I was reading this post on process pinning and thread affinity, and had a feeling that the "masklist" could help achieve the goal. What confused me is the description on masklist, which states:
Each mi number defines one separate domain. The following rule is used: the ith logical processor is included into the domain if the corresponding mi value is set to 1. All remaining processors are put into a separate domain. BIOS numbering is used
I don't quite understand what it is talking about. What is BIOS numbering? Could someone give me a solid example on the masklist with detailed explanations? The example given in that post still appears confusing to me, what is =[0001E,001E0,01E00,1E000]? What is the best method to realize what I want?
Thanks a lot for your time!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A way of accomplishing this is to set a value for MIC_KMP_PLACE_THREADS in the environment for each rank
MIC_ENV_PREFIX=MIC
MIC_KMP_PLACE_THREADS=15c,4t,0o
......
MIC_KMP_PLACE_THREADS=15c,4t,15o
.......................,30o
.......................,45o
That is, assign 15 cores of 4 threads per core, with a different offset, for each rank, so as to make the assignments non-overlapping.
This automatically sets OMP_NUM_THREADS=60 for each rank.
As MPI usually increases significantly the workload on the core which is running MPSS, you may find that one core doesn't have sufficient resource left to perform its share of user work, so it may work better if you assign only 56 or 59 cores.
You will probably want to set a value for OMP_PROC_BIND as well.
With this scheme, you can set 2 or 3 threads per core in case that suits your workload better, and still see the work spread across the cores without MPI processes contending for the same cores.
Each rank has to be listed separately for mpirun in order to make this one difference (offset) in the environment.
The Jeffers, Reinders book gives an example from before KMP_PLACE_THREADS was made available.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On Intel MPI 4.1.1, starting a job from the host, following work's for me:
export I_MPI_MIC=1
export I_MPI_PIN_MODE=pm # let hydra process manager generate appropriate pinning domain masks
mpirun -np 4 -env KMP_PLACE_THREADS 15C,4T -host mic0 ./a.out.mic
As mentioned by Tim, you could get better results if you reserve mic's core0 for system tasks, especially when using tcp or dapl/scif0 for heavy MPI communication. Reducing number of threads placed on each core from 4 to 3 or 2 can usually help as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can also take a look at this article. It is similar to the one you pointed out but has more examples with illustrations.
You can also use
I_MPI_PIN_DOMAIN=60:compact
This will allow you have a domain size of 60 i.e. 60 logical cores in each domain and the logical cores within the domain will be located as close as possible. You can also use other layout options as described by the document.
In my experience, I have found it helpful to set the KMP_PLACE_THREADS and KMP_AFFINITY in addition to I_MPI_PIN_DOMAIN. You can read more about KMP_PLACE_THREADS at http://software.intel.com/en-us/articles/openmp-thread-affinity-control.
I hope this is what you were looking for.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The affinity you describe is what already happens by default.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page