Quote:Tim Prince wrote:

Pierre_T_1 · ‎04-22-2015

Hello,

While porting an image processing library to the Xeon Phi, I stumbled upon a strange behaviour: the processing is about 20% faster when I set the number of threads to precisely 103 (I ran the processing multiple times using between 95 and 118 threads).

I tried to make sense of this by comparing vtune collections (advanced-hotspots, memory bandwidth and general exploration) of a test case running on 102, 103, 104 threads. Each of these analysis yielded similar results (except for the runtime, which was still 20% faster for 103 threads) and I wasn't able to identify what caused this unexpected speedup.

My questions are: Does this sort of behaviour rings a bell to any of you ? Do you have any pointers concerning the possible origin of this effect ?

Some precisions about the library: it uses manual offload and the MKL DFTI functions alongside with computational loops parallelized with OpenMP (each of these treatments account for half the computation time and are equaly affected by the 20% speedup).

Regards

Pierre T.

TimP · ‎04-22-2015

Did you watch core usage with micsmc GUI? A performance critical thread may get a core to itself for the most favorable result.

jimdempseyatthecove · ‎04-22-2015

This is likely due to (suspicious of) cache oblivious partitioning as opposed to cache aware tiling.

As an experimental proof, change the problem size, say by +1/3rd or +2/3rds. See if it remains optimal at 103 threads.

(This does assume you have sufficient quantity of work to employ more than 103 threads)

Jim Dempsey

Pierre_T_1 · ‎04-22-2015

Tim Prince wrote:

Did you watch core usage with micsmc GUI? A performance critical thread may get a core to itself for the most favorable result.

I just checked the individual core usage using micsmc GUI and did not notice a thread getting a core to itself. I forgot to mention that the thread affinity is set to scatter.

jimdempseyatthecove wrote:

As an experimental proof, change the problem size, say by +1/3rd or +2/3rds. See if it remains optimal at 103 threads.

For information, I was orignially dealing with image blocs sized 1024x1024, and due to architectural limitations the other dimensions I was able to test are 2048x2048 and 512x512.

Bloc size 2048x2048 : the optimal number of thread changed to 114 (with a lower 10% speedup)
Bloc size 512x512 : the optimal number of threads changed to 84 (also with a lower 10% speedup)

jimdempseyatthecove wrote:

This is likely due to (suspicious of) cache oblivious partitioning as opposed to cache aware tiling.

I'm not sure to fully understand what you mean by that. Are you saying that for these specific number of threads, cache usage is somehow better due to the load repartition ? If so, would not that show on the general exploration analysis of vtune ?

Thanks to both of you for your answers

Pierre T.

TimP · ‎04-22-2015

If adjacent threads access overlapping data regions, it's important to use KMP_PLACE_THREADS and OMP_PROC_BIND=close (or KMP_AFFINITY=compact or balanced) so that threads can share L2 cache. On the other hand, if some thread needs the entire L2 but doesn't share with another, you would avoid placing 2 such threads on a core.

James_C_Intel2 · ‎04-22-2015

When investigating OpenMP scaling on KNC I strongly recommend using KMP_PLACE_THREADS to control the number of cores and threads/core. You can (and should) then analyse the performance based on the number of cores used, with separate plots for 1T/C, 2T/C, 3T/C and 4T/C.

If you don't do this, it is very hard to know what resources you are actually using.

Consider the point labelled "60 threads" is that for

60 cores each with one thread
30 cores each with two threads
20 cores each with three threads
15 cores each with four thread

The performance of these possible options is likely to be very different.

Using KMP_PLACE_THREADS (and then *not* also using OMP_NUM_THREADS, or otherwise explicitly setting the number of threads) also ensures that you have the same number of threads/core, which is normally what you want, and certainly not what you have at 103 threads...

So, I would

Use KMP_PLACE_THREADS to control number of cores and threads/core
Try both KMP_AFFINITY=scatter and KMP_AFFINITY=compact
Plot performance with cores (1..60) on the X axis
For tuning, use VTune's relatively new OpenMP performance analysis (which shows load-imbalance and so on) https://software.intel.com/en-us/articles/how-to-analyze-openmp-applications-using-intel-vtune-amplifie-xe-2015

jimdempseyatthecove · ‎04-22-2015

>>thread affinity is set to scatter

This is fine, however....

The Xeon Phi (KNC) in-order core, when register-to-register compute bound, requires at least 2 threads per core for optimal throughput. Assuming 60 core model of KNC, your 512x512 maxing out at 84 threads implies 84/60 the max reached at 1.4 threads per core. This in turn indicates your algorithm is memory/cache latency bound. It might be beneficial to look into how to re-order your compute path to reduce latencies.

>> of a test case running on 102, 103, 104 threads. Each of these analysis yielded similar results (except for the runtime, which was still 20% faster for 103 threads)

This may indicate the compute efficiency is the same but in the slower cases you are spending an extra 20% in barrier (spin-wait) time. I haven't used it, but I think the newer version of VTune can show you the thread-by-thread spin-wait time. You can go after this time with better management of how you distribute the work.

>>Bloc size 512x512 : the optimal number of threads changed to 84 (also with a lower 10% speedup)

For this size, see what happens with:

KMP_AFFINITY=compact
KMP_PLACE_THREADS=2T

This will better utilize the core (at the expense of halving the effective size of the L1/L2 caches). Note, for host, or next version Knights Landing, this may not be (as) effective.

Jim Dempsey

Pierre_T_1 · ‎04-22-2015

Tim Prince wrote:

If adjacent threads access overlapping data regions, it's important to use KMP_PLACE_THREADS and OMP_PROC_BIND=close (or KMP_AFFINITY=compact or balanced) so that threads can share L2 cache. On the other hand, if some thread needs the entire L2 but doesn't share with another, you would avoid placing 2 such threads on a core.

In this case, two adjacent threads should not access overlapping data. Scatter affinity was used because it yielded better results performance-wise. KMP_AFFINITY=balanced has similar results but KMP_AFFINITY=compact lead to a runtime increase of about 60%.

For fun, I tried KMP_PLACE_THREADS=51c and oddly enough, 103 threads was still the optimal value, eventhough the repartition of the threads was similar to 118 threads with KMP_PLACE_THREADS not set (which might have the same effect as KMP_PLACE_THREADS=59c).

jimdempseyatthecove · ‎04-22-2015

Revision to #7

For this size, see what happens with:

KMP_AFFINITY=compact
KMP_PLACE_THREADS=2T

Even if this happens to resolve to 84 threads, compare the runtime of this 84 thread setup with the scatter 84 thread setup (don't forget to unset KMP_PLACE_THREADS)

Jim Dempsey

Pierre_T_1 · ‎04-23-2015

Pierre T. wrote:

Bloc size 512x512 : the optimal number of threads changed to 84 (also with a lower 10% speedup).

Ok, I'd like to go back on this. I might have done the tests in a hurry yesterday. the optimal value seems to be 86 with a speedup of about 40% (Which means the optimal value for 2048x2048 might be wrong too. I'll redo the tests if I manage to find some time)

jimdempseyatthecove wrote:

For this size, see what happens with:

KMP_AFFINITY=compact
KMP_PLACE_THREADS=2T

With these values, the optimal thead number still resolve to 86. In runtime comparison, scatter affinity is slightly better (~5%) and compact/2T is about the same as balanced (which might be explained by the fact that these two repartitions are similar, excluding the fact that balanced will use all the cores if I'm not mistaken)

James Cownie (Intel) wrote:

So, I would

Use KMP_PLACE_THREADS to control number of cores and threads/core

Try both KMP_AFFINITY=scatter and KMP_AFFINITY=compact

Plot performance with cores (1..60) on the X axis

For tuning, use VTune's relatively new OpenMP performance analysis (which shows load-imbalance and so on) https://software.intel.com/en-us/articles/how-to-analyze-openmp-applicat...

Thank you, I'm gonna try that and see if I can make sense of this effect.

James_C_Intel2 · ‎04-23-2015

balanced will use all the cores

Close... the way to think about the resource utilization of OpenMP is like this

If KMP_PLACE_THREADS is set the runtime effectively masks the incoming affinity (see the sched_{get,set}affinity system calls) to the cores/HW threads specified by and-ing together KMP_PLACE_THREADS and the incoming affinity.
The runtime then counts the number of logical CPUs in that mask, and that is the default number of threads it will create. (If you explicitly force more threads to be created you'll have over-subscription and likely poor performance).
It then uses the style of affinity (scatter, compact, balanced) to allocate the thread numbers to the threads depending on their location in the hardware. Both compact and balanced assign thread numbers incrementing first though threads bound to the same core, before moving to the next, while scatter moves to the next core before using all the threads on the core, and then wraps around if need be. Note, that if the number of threads/core is the same on every core, then compact and balanced give identical results. (Hence there's normally no point in using balanced).

So, the issue of which physical resources to use (cores/threads) is best considered as a separate issue from how you enumerate them once you've decided which they are. The affinities like scatter,compact,balanced affect the enumeration, but don't affect the resources available to be enumerated, which is controlled by the incoming affinity and KMP_PLACE_THREADS (the incoming affinity could be set by the offload mechanism, MPI, taskset, or whatever).

The confusing thing here is that they appear to affect the resources if you don't use all the resources you could (because if you have 60cx4t available, but say OMP_NUM_THREADS=4 KMP_AFFINITY=compact you'll use one core with four threads, whereas if you used KMP_AFFINITY=scatter you'll use four cores with one thread each). That can then make scatter look better than compact for four threads, because it has four times as much real resources :-). It's because of this that I recommend (as before)

Control the resources explicitly with KMP_PLACE_THREADS so you know exactly what you're using
Don't use OMP_NUM_THREADS or omp_set_num_threads(), but rely on the default behaviour to use all the hardware you asked for via KMP_PLACE_THREADS
Use KMP_AFFINITY= scatter or compact (try both), but don't bother with balanced since it's the same as compact when each core has the same number of threads on it (and, not doing that makes things *very* confusing)

HTH

Odd behaviour regarding execution time vs number of threads