On Intel MPI 4.1.1, starting

King_Crimson · ‎02-26-2014

Dear mic forum,

What I'd like to do is to split my 60-core coprocessor into 4 domains, pin one MPI process to each domain, and in each process let 60 threads be bound to 60 logical cores.

I was reading this post on process pinning and thread affinity, and had a feeling that the "masklist" could help achieve the goal. What confused me is the description on masklist, which states:

Each m_i number defines one separate domain. The following rule is used: the ith logical processor is included into the domain if the corresponding m_i value is set to 1. All remaining processors are put into a separate domain. BIOS numbering is used

I don't quite understand what it is talking about. What is BIOS numbering? Could someone give me a solid example on the masklist with detailed explanations? The example given in that post still appears confusing to me, what is =[0001E,001E0,01E00,1E000]? What is the best method to realize what I want?

Thanks a lot for your time!

TimP · ‎02-26-2014

A way of accomplishing this is to set a value for MIC_KMP_PLACE_THREADS in the environment for each rank

MIC_ENV_PREFIX=MIC

MIC_KMP_PLACE_THREADS=15c,4t,0o

......

MIC_KMP_PLACE_THREADS=15c,4t,15o

.......................,30o

.......................,45o

That is, assign 15 cores of 4 threads per core, with a different offset, for each rank, so as to make the assignments non-overlapping.

This automatically sets OMP_NUM_THREADS=60 for each rank.

As MPI usually increases significantly the workload on the core which is running MPSS, you may find that one core doesn't have sufficient resource left to perform its share of user work, so it may work better if you assign only 56 or 59 cores.

You will probably want to set a value for OMP_PROC_BIND as well.

With this scheme, you can set 2 or 3 threads per core in case that suits your workload better, and still see the work spread across the cores without MPI processes contending for the same cores.

Each rank has to be listed separately for mpirun in order to make this one difference (offset) in the environment.

The Jeffers, Reinders book gives an example from before KMP_PLACE_THREADS was made available.

marek_kaletka · ‎03-12-2014

On Intel MPI 4.1.1, starting a job from the host, following work's for me:

export I_MPI_MIC=1
export I_MPI_PIN_MODE=pm # let hydra process manager generate appropriate pinning domain masks
mpirun -np 4 -env KMP_PLACE_THREADS 15C,4T -host mic0 ./a.out.mic

As mentioned by Tim, you could get better results if you reserve mic's core0 for system tasks, especially when using tcp or dapl/scif0 for heavy MPI communication. Reducing number of threads placed on each core from 4 to 3 or 2 can usually help as well.

Sumedh_N_Intel · ‎03-12-2014

You can also take a look at this article. It is similar to the one you pointed out but has more examples with illustrations.

You can also use

I_MPI_PIN_DOMAIN=60:compact

This will allow you have a domain size of 60 i.e. 60 logical cores in each domain and the logical cores within the domain will be located as close as possible. You can also use other layout options as described by the document.

In my experience, I have found it helpful to set the KMP_PLACE_THREADS and KMP_AFFINITY in addition to I_MPI_PIN_DOMAIN. You can read more about KMP_PLACE_THREADS at http://software.intel.com/en-us/articles/openmp-thread-affinity-control.

I hope this is what you were looking for.

Gregg_S_Intel · ‎03-13-2014

The affinity you describe is what already happens by default.

question about "I_MPI_PIN_DOMAIN=<masklist>"