Thanks Tim. I was doing some

Pramod_K_ · ‎03-29-2015

Hello All,

When I run pure OpenMP example on MIC, I find KMP_PLACE_THREADS very useful. (for example, I can benchmark using 8 cores and 3 threads on every core with "KMP_PLACE_THREADS=8c,3t,0O"

What is MPI equivalent for this? (I am running pure mpi application natively on MIC with Intel MPI)

In the documentation I see I_MPI_PIN_PROCESSOR_LIST where I can provide the list of specific processors. Is there any other way?

Thanks.

TimP · ‎03-29-2015

the default settings of Intel mpi will pin ranks evenly across Mic cores. If this isn't satisfactory for your situation you would need to explain further. You could look up how to pin 24 ranks across 8 cores but this seems an unlikely tactic.

Pramod_K_ · ‎04-10-2015

Thanks Tim. I was doing some performance analysis of our application and was interested in looking at how time to solution changes with increase in cores from 1 to 60 (and hence without even distribution of ranks). (one of our application is bandwidth limited and hence I wanted to know how many cores saturate bandwidth)

Currently I am using I_MPI_PIN_PROCESSOR_LIST where I specify list of cores where I exclusively pin the mpi processes.

TimP · ‎04-11-2015

You might find mic_smc GUI core utilization view useful to check how your ranks are pinned, as well as consulting Intel mpi pdf. Note that i_mpi_pin_procs =allcores should happen by default, placing each rank on a separate core. Setting 1-24 apparently would crowd 24 ranks into 6 cores, like kmp_place_threads=6c,4t.

In mpi thread funneled mode you can play optimization on num threads and number of processes. Typically 2 or 3 threads per core are needed to take advantage of vpu rotation among threads. On an application big enough to require maximum stack setting, I found an optimum at 6 ranks of 30 threads each, setting kmp_affinity =balanced to spread threads evenly across the cores assigned to each rank.

smaller applications may not show as critical an optimum but more than 1 rank per available core not counting the core which is busy with mpss and mpi overhead is likely to be slow.

Mark_L_Intel · ‎06-03-2015

You can also use the I_MPI_PIN_DOMAIN=omp setting to control process pinning (I_MPI_PIN_DOMAIN=<size>[:<layout>])

For example,

export OMP_NUM_THREADS=4

export I_MPI_PIN_DOMAIN=omp

where, I_MPI_PIN_DOMAIN splits logical processors into non-overlapping subsets. Mapping rule: 1 MPI process per 1 domain. And then you could pin OpenMP threads inside the domain with KMP_AFFINITY. If the I_MPI_PIN_DOMAIN environment variable is defined, then the I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.

Please look for more details:

https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/win/Reference_Manual/Interoperability_with_OpenMP.htm

https://software.intel.com/en-us/articles/mpi-and-process-pinning-on-xeon-phi

MPI equivalent of KMP_PLACE_THREADS on MIC