Solved: Very thanks Mr Dempsey.

kadir · ‎06-04-2014

'a' is a double array of size 10000. It is small enough to reside in cache of a core. Following operations run on a single core of Xeon Phi. The operations may be parallellized for 1,2, 3, or 4 threads.

Operation A:

for(i=0; i<10000;i++)

a+=alpha;

Operation B:

for(i=0; i<10000;i++)

alpha+=a;

Is it possible to improve 'Operation A' so that it does not perform worse than 'Operation B'?

McCalpinJohn · ‎06-04-2014

The simple answer is "no, it is not possible to improve 'Operation A' so that it is as fast as 'Operation B'". (It is, of course, possible to modify 'Operation B' so that it is as slow as 'Operation A', but that is a classic case of "fixing a ratio the wrong way".)

I will start with Operation B (alpha += a) since it is easier.

The performance of operation B depends on the initial location and state of the cache lines containing the a[] array.

In the simplest (and fastest case) assume that a[] is in the local L2 cache in the shared, exclusive, or modified state.

The minimum execution time on a single core is given by the time required to load all the data into the core. 10,000 doubles occupy 1250 cache lines. The L2 cache can deliver 8 cache lines every ~25 cycles (independent of the number of threads used on the core) so the 1250 lines will take just under 4000 cycles.
- Since none of the elements of a[] are being modified, none of the loads from L2 to L1 will cause L1 victims to be written back to the L2.
Multiple summation variables will be needed for the core to keep up. A serialized scalar summation requires ~6 cycles per element, or 60,000 cycles for 10,000 elements. Since we want to complete in under 4000 cycles, we need at least 15 summation variables. This is easily accomplished using two vector registers with 8 summation variables each, requiring ~3750 cycles.
The interaction of the arithmetic and the cache accesses means that one more bit of optimization is required. If there are only two vector registers being used for the summation, there will generally be only 2 concurrent accesses to the L2. Since full L2 bandwidth requires 8 concurrent accesses, one needs to either add software prefetches to move the data from the L2 to the L1, or just go ahead and use eight vector registers to hold the partial sum values. I prefer the latter approach, and have managed to get summation performance that is very close to the expected rate of 8 cache lines every ~25 cycles.

Operation A (a[]+=alpha) requires much more cache traffic.

In the fastest case, assume a[] is in the local L2 cache in the exclusive or modified state.

We still need to read the 1250 cache lines from the L2 to the L1.
But in this case we modify each of those lines, so that when the modified lines are chosen as L1 victims, they must be written back to the L2.
- This requires additional L1 bandwidth to write the dirty lines and additional L2 bandwidth to receive the dirty lines.
I have not measured this particular case, but I can make a pretty good guess about the minimum time.
- The L2 needs to read 1250 cache lines (to send to the L1) and needs to write 1250 cache lines (writebacks from the L1). At the maximum data transfer rate of 32 Bytes/cycle, this will take 5000 cycles.
- The L1 cache will only require 2500 cycles to write the 1250 lines it receives and to read the 1250 lines it needs to write back, so it will not be the bottleneck.
- This assumes that the writebacks occupy the L2 interface cycles that cannot be used for reads due to the limit of 8 concurrent L1 cache misses. This is a "best case" assumption -- reality seldom allows perfect overlap.

If the data is initially in the L2 cache in the shared state, performance will be much slower because of the need to obtain exclusive access to each line. This case is probably similar in performance to the case for which the data is initially in memory. In this case the latency is ~300 cycles (rather than 25 for L2-resident data), so performance should drop by a factor of 12 or so. There are additional complexities here, but since this is not the primary case of interest I will stop here.

View solution in original post

James_C_Intel2 · ‎06-04-2014

This may be irrelevant, since I read your comment back to front (with A and B reversed in my head) but since it's already posted I'll just say that here and avoid you having to point it out :-)

If you are doing operation B in parallel, then you can avoid the problem anyway, since if you write it as

#pragma omp parallel for reduction(+:alpha)

for (i=0;i<10000;i++)
    alpha += a;

your threads will each reduce into a thread local variable and the final reduction will be up a tree into the global result. there will therefore be no contention. (Of course if you parallelise it naively you have a race on alpha). You'll probably want to look at OpenMP 4.0 affinity directives here to create teams of four threads, though it's not clear that you can win for such a small parallel region.

If the loop really is that simple, then the compiler should vectorize it, and it's not clear that using more than one thread/core will then be beneficial.

jimdempseyatthecove · ‎06-04-2014

Jim,

Kadir's question was not related to parallization, rather it has to deal with Read/Modify/Write vs Read

In case A, assuming vectorization and cache aligned data, the operations are:

vecReg = a[...]; vecReg += alpha'; a[...] = vecReg;

The above has two accesses to L2 cache (assuming a were in L2 to begin with) and alpha' is a vector of alpha's

In case B, assuming vectorization, the operations are:

vecReg = a[...]; alpha' += vecReg;

The above has one access to L2 cache (assuming streaming stores NOT used) and alpha' is simd reduction which at end of loop gets horizontally added.

The actual code will differ from the pseudo code as the optimization may perform register stuffing (somewhat like loop unrolling excepting that different registers used and instructions are interleaved).

As for L2 access A has order of 2 accesses, B has order of 1 access.

Jim Dempsey

James_C_Intel2 · ‎06-04-2014

Kadir's question was not related to parallization, rather it has to deal with Read/Modify/Write vs Read

Right, that's why I said "This may be irrelevant, since I read your comment back to front " :-)

On his actual question, it seems unlikely that on a code that is likely already bandwidth limited you can reasonably expect to halve the floating point intensity (from one op/word transferred to 0.5op/word) and expect it to perform the same.

jimdempseyatthecove · ‎06-04-2014

Now for parallelization on Xeon Phi.

Presumably the array of 10,000 is to be processed by 1 to 4 threads from the same core. In this case, there is no #pragma omp ... that will constitute a subset thread team consisting of the hardware threads contained within a single core. With this in mind, you likely have introduced a similar functionality to what was used in my IDZ blog series "The Chronicles of Phi" illustrating what I call a Hyperthread Phalanx. Due to lack of appropriate #pragma's one is lead to hand partition the loop and hand reduction. These operations can be augmented through use of class objects and/or templates.

Jim Dempsey

jimdempseyatthecove · ‎06-04-2014

>>bandwidth limited

Case B is bandwidth limited to fetch time from L2 cache (assuming data in L2 as stated). Two or three threads from one core will hit the bandwidth limitation.

Case A is a little different, not only in that its L2 pressure is 2x that of case B, but it is also writing to memory (and cache). The writes to memory at some point will exceed the core's ability to buffer writes and potentially hit the memory bandwidth (though in this case I do not think so). Additionally, the cache coherency system may take a little longer in light that the writes may cause cache line evictions on other core's L2's ("may" is used since these processor design engineers are quite skilled at their work).

Jim Dempsey

kadir · ‎06-04-2014

Very thanks Mr Dempsey.

From your sentences,

but it is also writing to memory (and cache).

and

The writes to memory at some point will exceed the core's ability to buffer writes and potentially hit the memory bandwidth

, I understood followings for single thread case:

1) Write of a recently Read cache line costs at least 1 cycle. The cost of Write cannot be hidden somehow.

2) Every Write to cache also induces a Write to memory. Thanks to writeback cache, the cost of writing to memory is hidden (may be to some extend according to sentence 2).

If I have a misunderstanding, would you please correct me?

McCalpinJohn · ‎06-04-2014

The simple answer is "no, it is not possible to improve 'Operation A' so that it is as fast as 'Operation B'". (It is, of course, possible to modify 'Operation B' so that it is as slow as 'Operation A', but that is a classic case of "fixing a ratio the wrong way".)

I will start with Operation B (alpha += a) since it is easier.

The performance of operation B depends on the initial location and state of the cache lines containing the a[] array.

In the simplest (and fastest case) assume that a[] is in the local L2 cache in the shared, exclusive, or modified state.

The minimum execution time on a single core is given by the time required to load all the data into the core. 10,000 doubles occupy 1250 cache lines. The L2 cache can deliver 8 cache lines every ~25 cycles (independent of the number of threads used on the core) so the 1250 lines will take just under 4000 cycles.
- Since none of the elements of a[] are being modified, none of the loads from L2 to L1 will cause L1 victims to be written back to the L2.
Multiple summation variables will be needed for the core to keep up. A serialized scalar summation requires ~6 cycles per element, or 60,000 cycles for 10,000 elements. Since we want to complete in under 4000 cycles, we need at least 15 summation variables. This is easily accomplished using two vector registers with 8 summation variables each, requiring ~3750 cycles.
The interaction of the arithmetic and the cache accesses means that one more bit of optimization is required. If there are only two vector registers being used for the summation, there will generally be only 2 concurrent accesses to the L2. Since full L2 bandwidth requires 8 concurrent accesses, one needs to either add software prefetches to move the data from the L2 to the L1, or just go ahead and use eight vector registers to hold the partial sum values. I prefer the latter approach, and have managed to get summation performance that is very close to the expected rate of 8 cache lines every ~25 cycles.

Operation A (a[]+=alpha) requires much more cache traffic.

In the fastest case, assume a[] is in the local L2 cache in the exclusive or modified state.

We still need to read the 1250 cache lines from the L2 to the L1.
But in this case we modify each of those lines, so that when the modified lines are chosen as L1 victims, they must be written back to the L2.
- This requires additional L1 bandwidth to write the dirty lines and additional L2 bandwidth to receive the dirty lines.
I have not measured this particular case, but I can make a pretty good guess about the minimum time.
- The L2 needs to read 1250 cache lines (to send to the L1) and needs to write 1250 cache lines (writebacks from the L1). At the maximum data transfer rate of 32 Bytes/cycle, this will take 5000 cycles.
- The L1 cache will only require 2500 cycles to write the 1250 lines it receives and to read the 1250 lines it needs to write back, so it will not be the bottleneck.
- This assumes that the writebacks occupy the L2 interface cycles that cannot be used for reads due to the limit of 8 concurrent L1 cache misses. This is a "best case" assumption -- reality seldom allows perfect overlap.

If the data is initially in the L2 cache in the shared state, performance will be much slower because of the need to obtain exclusive access to each line. This case is probably similar in performance to the case for which the data is initially in memory. In this case the latency is ~300 cycles (rather than 25 for L2-resident data), so performance should drop by a factor of 12 or so. There are additional complexities here, but since this is not the primary case of interest I will stop here.

jimdempseyatthecove · ‎06-04-2014

Good description of what is going on. Much better than my oversimplification. Yours describe circumstances under which A can be much worse than 2x B.

What are your thoughts on when/if the write-back queue to RAM gets filled?

Keep in mind the variances of only one core performing operation A and all cores (~60) performing operation A on different data.

Jim Dempsey

McCalpinJohn · ‎06-04-2014

Based on the STREAM benchmark, it looks like one core can only move about 3.5 to 4 GB/s to/from DRAM. Read-only rates are similar. I have not tested stores in isolation. Non-temporal stores on Xeon Phi are not the same as non-temporal stores in other Xeon processors. For most Xeon processors (except the Xeon E7, if I understand correctly), a streaming store writes directly to a store buffer, then when the store buffer is full it sends it directly to memory. In this scenario the memory is never read before being overwritten. For Xeon Phi, non-temporal stores skip the read of the data from the L2 into the L1, but they don't skip the read of the data from memory to the L2. Eliminating the allocation into the L1 cache saves reading the line from the L2, writing it into the L1, then reading it back out from the L1 when it is chosen as the victim line. This seems to help the STREAM benchmark a lot, but I have not found it to help other codes very much. (Part of this is the difficulty of getting the compiler to generate non-temporal stores for all of the desired store targets in more complex codes.)

This loop is much to small to consider for parallelization on Xeon Phi. The overhead of an OMP PARALLEL FOR loop varies from thousands of cycles (when limited to the four threads on one physical core) to several 10's of thousands of cycles when using all cores, so there is no way that you would be able to speed up a loop that should only take 4000 cycles. The overhead could be improved with specialized code generation for threads that are known to be on the same chip, but parallelization across cores will be fairly slow even with perfect code because of the high cache-to-cache intervention latency.

Read&Write cost vs Read cost of a cache line