KMP_AFFINITY, OMP proc_bind woes for MIC native execution

Matevž_T_ · ‎07-11-2014

Hello everybody,

I'm experimenting with a trivial OMP program (https://github.com/osschar/mtorture/blob/master/t2.cxx#L60) and I'm having trouble setting up thread affinity - I'm trying with 4 threads as this should allow me to achieve all possible core population schemes. Here are my observations:

proc_bind OMP directive is ignored;
KMP_PLACE_THREADS does weird things:
1. KMP_AFFINITY=compact spreads threads out;
2. KMP_AFFINITY=scatter puts them on the same core;
3. KMP_AFFINITY=balanced causes floating point exception (unless used with MIC_ENV_PREFIX=XXX XXX_KMP_AFFINITY=balanced - then it runs on cores 1, 2, 3, 4).

Below is a list of my experiments (Perf 4 would mean perfect usage of resources, I get ~0.92 on a single thread). This was first observed on icc-14.0.0 and now I see the same on 14.0.3.

Please help! :)

Best,
Matevž

KMP_AFFINITY=scatter
Cores: 1 2 3 4
Perf:  1.92

KMP_AFFINITY=compact
Cores: 1 62 123 184
Perf:  3.74

KMP_PLACE_THREADS=1T KMP_AFFINITY=scatter
KMP_PLACE_THREADS=1T KMP_AFFINITY=compact
Cores: 1 65 125 185
Perf:  3.76

KMP_PLACE_THREADS=2T KMP_AFFINITY=scatter
Cores: 1 2 125 126
Perf:  3.10

KMP_PLACE_THREADS=2T KMP_AFFINITY=compact
Cores: 1 62 122 185
Perf:  3.76

jimdempseyatthecove · ‎07-11-2014

How are you obtaining your reports?

In particular are the Core numbers actually processor core numbers

.OR. are they Linux Logical Processor numbers?

Linux logical processor numbers need not have any association with physical cores and hardware threads therein of the processor.

If you are using CPUID to get the APIC numbers, then you will get the core number and can get the thread number within the core.

If you use omp_get_thread_num() to get the OpenMP team member number then there is absolutely no assurance that the OpenMP team member number relates to the hardware core and logical processor number.

sched_xxx functions to get the affinity mask returns a mask relative to the logical processor number on the system. For most systems, there is a one-to-one correlation of the logical processor to the "hardware thread proximity".... But there is no requirement for this to be so.

The CPUID (or CPUIDEX) fetch of the APIC/APIC2 is mostly correct, except possibly when running under Virtualization (Hypervisor presenting virtual CPU's).

Additional notes.

Depending on O/S and OpenMP library function...

KMP_AFFINITY=compact, presents the thread affinity mask permission to migrate amongst the HT's of a core. IOW the hardware thread number and the Linux logical processor number may vary through the life of the OpenMP thread (on systems with HT capability).

Jim Dempsey

Matevž_T_ · ‎07-11-2014

This is all running on MIC in native mode, executable is started on MIC.

I get core numbers by looking at top. The numbers seem to correspond to what KMP_AFFINITY=verbose produces, i.e.

OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 0 thread 2
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 0 thread 3
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 1 thread 1
...

This is also consistent with performance measurements:

When threads run on cores 1,2,3,4 maximum gain is ~ factor of two.
When threads go off to different cores, I get proper scaling.
When threads run on cores 1,2 and 125,126 the scaling is somewhat poorer (as expected, as the code is at the limit of being L2 bandwidth limited).

My main point was that KMP_AFFINITY works the other way around as expected and is even broken for "balanced" in some strange way (and that proc_bind seems to ignored - but this might be a symptom of the same problem).

jimdempseyatthecove · ‎07-12-2014

>>When threads run on cores 1,2 and 125,126 the scaling is somewhat poorer (as expected, as the code is at the limit of being L2 bandwidth limited).

I believe you ought to say: When threads run on OS procs 1,2 and 125,126

And not: ... run on cores...

For Xeon Phi you typically need at least 2 threads per core.

OS procs 1,2 and 125,126 are cores 0 and 30. A Xeon Phi does not have 126 cores.

>>When threads go off to different cores, I get proper scaling

Proper scaling and best performance are not one and the same with Xeon Phi.

If you are undersubscribing threads, say to 60 or less, then you might conceivably want to use one thread per core, and observe "proper scaling".

As you add threads, you will note on Xeon Phi, that the slopes change, but nowhere near the change that you see on the host processor. Depending on your test program, the additional threads may cost 0 time (compute bound) or hit a wall (memory bandwidth bound).

If you intend to run with 30 threads or les, then you might as well run on the host processor.

Jim Dempsey

Matevž_T_ · ‎07-13-2014

Hi Jim,

Thanks for your explanations ... I agree that the total usage of the card is what one should be concerned with at the end. I'm still relatively new to the thing so I tried to give it a slow start to make sure I understand (almost) every step in the attempt to scale up our application - we're trying to port algorithms that search for particle tracks in high energy physics experiments and this involves a lot of small matrix (3x3 to 6x6) operations. We're definitely memory bound and the name of the game for us will be to find a problem decomposition that is small enough to fit well enough into L1 cache and is large enough to make full use of vectorization.

Starting as an experimental physicist I also ended up as an experimental programmer ... so I often prefer trying things out to reading the documentation :)

Best,

Matevz

jimdempseyatthecove · ‎07-14-2014

Data organization will be key to performance.

Where "organization" does not mean organized in an abstract sense, rather here it means organized for vector processing. To an OOP programmer this may seem more like POO. The Xeon Phi has been out long enough for you to find a paper relating to a problem similar to yours. I'd recommend trying to locate such a paper as it will provide insight to how to best (better) address this type of problem (assuming the programmers did their job right).

Fitting into L1 cache is but one aspect of problem. The cache system has two beneficial attributes: faster access times, and reduction in memory bandwidth requirements. Too often the only aspect looked at the faster access time. While the two are interrelated there are other factors involved that require equal attention. Re-use of the data together with streaming fetches seem to work best. This combination tends to be best programmed using a pipeline organization where the different stages overlap. This requires a different way of thinking about how to solve the problem.

Do not think in terms of performing one small 3x3 to 6x6 matrix operation as fast/efficient as possible, rather think in terms of a lot of particle tracks as fast as possible. IOW work on 8 or 16 particles in the same operation. A matrix operation that takes 3x more steps working on 8 or 16 particles is much faster than the faster than the 1x steps working on one particle.

Jim Dempsey

McCalpinJohn · ‎07-14-2014

It can take quite a lot of experimentation to begin to understand the performance characteristics of Xeon Phi! Many of the properties are counter-intuitive, and a relatively detailed understanding of the implementation seems essential.

A couple of interesting (to me, anyway) observations about memory-bound codes at various levels of the cache hierarchy:

L1-bound require two threads per core, since the L1 cache can deliver one cache line per cycle, but each thread can only run every other cycle.
1. Alignment is critical here, since loads of unaligned data have to be done twice (VLOADUNPACKLO + VLOADUNPACKHI) and the load cannot be merged into the arithmetic instructions (which can only reference 64-Byte-aligned memory locations) so you have three vector instructions to issue instead of one. (vloadunpacklo + vloadunpackhi + "vector arithmetic on registers" vs "vector arithmetic with an aligned memory operand").
2. If you know the data will be in the L1 cache, performance is improved if you disable software prefetching (since the software prefetches use valuable instruction issue slots).
L2-bound codes *can* reach maximum performance using 1 thread per core, but
1. Software prefetches (VPREFETCH0) are critical to maintain the maximum number of concurrent L1 cache misses (which appears to be 8).
2. Maximum L2 read bandwidth is roughly 8 cache lines per 24 cycles. I thread can easily issue 8 vector operations paired with 8 VPREFETCH0 instructions in the 12 issue slots available in 24 cycles.
3. BUT, if the data being loaded is not aligned, each vector register will require 2 loads (VLOADUNPACKLO + VLOADUNPACKHI), so you need 16 issue slots in 24 cycles, which means you need at least 2 threads/core.
Memory-bound codes usually run best with no more than one thread per core. This is probably due to the negative effects of "page thrashing" when there are more memory access streams than there are DRAM pages (256 on the Xeon Phi SE10P).
1. Memory-bound codes with no more than (roughly) 2-3 memory access streams per thread will typically benefit significantly from increasing the prefetch-ahead distances (relative to the compiler defaults). For STREAM, adding the "-opt-prefetch_ahead=64,8" compilation flag increases the 60-thread Triad performance by about 25%.
2. Memory-bound codes with "store misses" to arrays that are not going to be re-used before they get evicted from the caches will benefit from non-temporal stores. The compiler will generate these by default in some cases, but in other cases you will need to add a "#pragma vector nontemporal" to request streaming stores (and streaming store evictions).
3. Sometimes the default choice for streaming store evictions degrades performance. For STREAM, adding the "-opt-streaming-cache-evict=0" compilation flag increases the 60-thread Triad performance by about 10% (on top of the 25% mentioned above).
4. Memory-bound codes with many memory access streams per thread (e.g. 8 or more) will often run faster using fewer than 60 cores. As a rule of thumb, I have found that the total number of memory access streams should not exceed the number of DRAM banks (256), but the number of threads/cores also needs to be at least ~30. You will have to experiment to find the best balance between concurrency and DRAM page thrashing.
5. Memory-bound codes with more than 16 memory access streams per thread will overflow the L2 hardware prefetcher. This can cause a big performance drop, since L2 hardware prefetches typically provide most of the memory concurrency on Xeon Phi. Splitting loops into multiple (sequential) loops with 16 or fewer memory access streams each can provide a large performance gain, even if some of the arrays have to be loaded more often in the split version.

Matevž_T_ · ‎07-14-2014

Thanks Jim, John!

Indeed we are trying to work on a set of tracks already :) I'm writing a testing prototype that stores a set of matrices in a "matrix-major" format, grouping elements (0,0) etc. for a set of matrices in contiguous, cache-line aligned locations so that matrix operations can be performed on the set in vector form. With this, I can already do 6x6 matrix multiplication on a set of matrices with 60% of optimal performance (i.e., doing 9.6 FP ops per work-cycle) when the set is small enough to fit into L1 and the outer loops are unrolled manually (by a perl script :)).

Also, I'm beginning to realize that figuring out how to do prefetching right will be crucial - and that one has to get his hands dirty doing that. Two questions:

Given that our problem is rather "chunked-up", should we foresee using intrinsics for operations and prefetches sooner rather than later? For now we're trying to get by by using pragmas only.
We realize we're late to the game and that by the time we're done KNL will be out. Is there a subset of problems that will be alleviated by KNL and we can sort of postpone them until then? Or will things only get harsher and we should try to squeeze every drop out of KNC, if nothing else as a propaedeutic exercise?

Best, Matevz