- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello All,
When I run pure OpenMP example on MIC, I find KMP_PLACE_THREADS very useful. (for example, I can benchmark using 8 cores and 3 threads on every core with "KMP_PLACE_THREADS=8c,3t,0O"
What is MPI equivalent for this? (I am running pure mpi application natively on MIC with Intel MPI)
In the documentation I see I_MPI_PIN_PROCESSOR_LIST where I can provide the list of specific processors. Is there any other way?
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Tim. I was doing some performance analysis of our application and was interested in looking at how time to solution changes with increase in cores from 1 to 60 (and hence without even distribution of ranks). (one of our application is bandwidth limited and hence I wanted to know how many cores saturate bandwidth)
Currently I am using I_MPI_PIN_PROCESSOR_LIST where I specify list of cores where I exclusively pin the mpi processes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You might find mic_smc GUI core utilization view useful to check how your ranks are pinned, as well as consulting Intel mpi pdf. Note that i_mpi_pin_procs =allcores should happen by default, placing each rank on a separate core. Setting 1-24 apparently would crowd 24 ranks into 6 cores, like kmp_place_threads=6c,4t.
In mpi thread funneled mode you can play optimization on num threads and number of processes. Typically 2 or 3 threads per core are needed to take advantage of vpu rotation among threads. On an application big enough to require maximum stack setting, I found an optimum at 6 ranks of 30 threads each, setting kmp_affinity =balanced to spread threads evenly across the cores assigned to each rank.
smaller applications may not show as critical an optimum but more than 1 rank per available core not counting the core which is busy with mpss and mpi overhead is likely to be slow.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can also use the I_MPI_PIN_DOMAIN=omp setting to control process pinning (I_MPI_PIN_DOMAIN=<size>[:<layout>])
For example,
export OMP_NUM_THREADS=4
export I_MPI_PIN_DOMAIN=omp
where, I_MPI_PIN_DOMAIN splits logical processors into non-overlapping subsets. Mapping rule: 1 MPI process per 1 domain. And then you could pin OpenMP threads inside the domain with KMP_AFFINITY. If the I_MPI_PIN_DOMAIN environment variable is defined, then the I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.
Please look for more details:
https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/win/Reference_Manual/Interoperability_with_OpenMP.htm
https://software.intel.com/en-us/articles/mpi-and-process-pinning-on-xeon-phi

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page