Xeon Phi - Balanced vs Scatter - 64 or Less Threads

CPati2 · ‎02-07-2018

Hi All,

If the number of threads to use are 32 or any number equal or less than 64. Then, how does balance and scatter thread affinity differ from each other?

As per my analysis, these two will use same number of cores (1 thread per core). Not that balance will use 2 thread per core to keep sequential thread together, leading to only 16 cores?

Thanks.

McCalpinJohn · ‎02-07-2018

If I recall correctly, Balanced and Scatter should be the same if you are only using one thread per core.

If you are using more than one thread per core, Balanced and Scatter will have different layouts. For example, on a 64-core part using 128 threads, Balanced will put threads 0&1 on core 0, 2&3 on core 1, etc, while Scatter will put threads 0&64 on core 0, 1&65 on core 1, etc.

Neither of these schemes is easy to understand if the number of threads is not evenly divisible by the number of cores. In such cases I find it much easier to use KMP_HW_SUBSET to force the allocation to be inside the requested subset of cores/threads. On our 68-core Xeon Phi 7250 if I wanted to use 64 cores with different numbers of threads, I would use:

1 thread/core: KMP_HW_SUBSET=64c,1t OMP_NUM_THREADS=64 KMP_AFFINITY=compact
2 threads/core: KMP_HW_SUBSET=64c,2t OMP_NUM_THREADS=128 KMP_AFFINITY=compact
4 threads/core: KMP_HW_SUBSET=64c,3t OMP_NUM_THREADS=256 KMP_AFFINITY=compact

These three schemes emulate what "Balanced" would do if it were run on a 64-core system.

You should always add the "verbose" clause to KMP_AFFINITY to verify that the system did what you wanted....

James_C_Intel2 · ‎02-08-2018

As ever, John is on the money. It is much easier to use KMP_HW_SUBSET to limit the available resources and then play with "compact" or :scatter" affinity than to try to achieve good balance with KMP_AFFINITY=balanced

One thing which John is doing, which I would not (and which has introduced a bug in his text above :-)) is that he is using OMP_NUM_THREADS as well as KMP_HW_SUBSET. I find it better not to use OMP_NUM_THREADS, since that gives you the opportunity to have a mismatch between the number of HW threads allocated and the number of software threads created. If you leave out OMP_NUM_THREADS and just use KMP_HW_SUBSET, the library's default behaviour of running one thread on each available logicalCPU will kick in, and you can't make a mistake like that in John's third line

4 threads/core: KMP_HW_SUBSET=64c,3t OMP_NUM_THREADS=256 KMP_AFFINITY=compact

where there 3t was intended to be 4t, and he's running 256 threads on 192 logicalCPUs...

So I'd just use

1 thread/core: KMP_HW_SUBSET=64c,1t KMP_AFFINITY=compact
2 threads/core: KMP_HW_SUBSET=64c,2t KMP_AFFINITY=compact
4 threads/core: KMP_HW_SUBSET=64c,4t KMP_AFFINITY=compact

and then also try KMP_AFFINITY=scatter.

This then makes it easier to experiment with scaling, simply by changing the number of cores you ask for. (as described in "How to Plot OpenMP Scaling Results").

McCalpinJohn · ‎02-08-2018

Hurray for sharp eyes! I knew I was going to make a mistake with those numbers, and I was right!

I think one of the OMP placement directives typically gives the effect of "balanced", but the standard allows named distributions to have implementation-defined behavior, so I don't use them. Numbers work -- at least when you get them right. ;-)

CPati2 · ‎02-15-2018

Hi John,

I am getting a bit confused with environment variables. As I understand, to run 128 threads as Scatter, Compact and Balanced, I need following environment variables:

1) Scatter: export OMP_NUM_THREADS=128 export KMP_AFFINITY=scatter,granularity=fine
2) Compact: export OMP_NUM_THREADS=128 export KMP_AFFINITY=compact,granularity=fine
3) Balanced: export OMP_NUM_THREADS=128 export KMP_AFFINITY=balanced,granularity=fine

After reading last paragraph in the documentation here https://software.intel.com/en-us/node/522518, I am confused whether (3) is correct or not? As I understand this documentation is pointing to Intel Xeon Phi KNC not Intel Xeon Phi KNL?

To set the balanced affinity type for only the Intel® MIC Architecture environment, assign a specific prefix using the MIC_ENV_PREFIX=prefix and then set prefix_KMP_AFFINITY with balanced.

Thanks.

James_C_Intel2 · ‎02-16-2018

I am explicitly not going to tell you how to use KMP_AFFINITY=balanced, because there is no reason to use it; as you are discovering it is hard to use and confusing.

All of the interesting options are covered by the use of KNP_HW_SUBSET and KMP_AFFINITY={scatter,compact} in a way which is comprehensible and easier to get right.

McCalpinJohn · ‎02-16-2018

Aha! I am not the only one who can make mistakes!

s/KNP_HW_SUBSET/KMP_HW_SUBSET/

CPati2 · ‎02-16-2018

Hi James, John,

It is finally getting clear to me. I got confused as I wanted to use KMP_AFFINITY=balanced as I thought without which Balanced can't be achieved.

1) Scatter 128 threads: 2 threads/core: KMP_HW_SUBSET=64c,2t KMP_AFFINITY=scatter
2) Balanced 128 threads: 2 threads/core: KMP_HW_SUBSET=64c,2t KMP_AFFINITY=compact

Why is there even KMP_AFFINITY=balanced option, it can be really confusing for new user. I expect it to do what (2) would do above, but it doesn't seem to be the case.

Thank you for clarifying this.