When I run pure OpenMP example on MIC, I find KMP_PLACE_THREADS very useful. (for example, I can benchmark using 8 cores and 3 threads on every core with "KMP_PLACE_THREADS=8c,3t,0O"
What is MPI equivalent for this? (I am running pure mpi application natively on MIC with Intel MPI)
In the documentation I see I_MPI_PIN_PROCESSOR_LIST where I can provide the list of specific processors. Is there any other way?
Thanks Tim. I was doing some performance analysis of our application and was interested in looking at how time to solution changes with increase in cores from 1 to 60 (and hence without even distribution of ranks). (one of our application is bandwidth limited and hence I wanted to know how many cores saturate bandwidth)
Currently I am using I_MPI_PIN_PROCESSOR_LIST where I specify list of cores where I exclusively pin the mpi processes.
You might find mic_smc GUI core utilization view useful to check how your ranks are pinned, as well as consulting Intel mpi pdf. Note that i_mpi_pin_procs =allcores should happen by default, placing each rank on a separate core. Setting 1-24 apparently would crowd 24 ranks into 6 cores, like kmp_place_threads=6c,4t.
In mpi thread funneled mode you can play optimization on num threads and number of processes. Typically 2 or 3 threads per core are needed to take advantage of vpu rotation among threads. On an application big enough to require maximum stack setting, I found an optimum at 6 ranks of 30 threads each, setting kmp_affinity =balanced to spread threads evenly across the cores assigned to each rank.
smaller applications may not show as critical an optimum but more than 1 rank per available core not counting the core which is busy with mpss and mpi overhead is likely to be slow.
You can also use the I_MPI_PIN_DOMAIN=omp setting to control process pinning (I_MPI_PIN_DOMAIN=<size>[:<layout>])
where, I_MPI_PIN_DOMAIN splits logical processors into non-overlapping subsets. Mapping rule: 1 MPI process per 1 domain. And then you could pin OpenMP threads inside the domain with KMP_AFFINITY. If the I_MPI_PIN_DOMAIN environment variable is defined, then the I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.
Please look for more details: