Poor Haswell performance with multiple writes in a loop

Nathan_K_3 · ‎11-12-2015

I'm finding some cases where simple loops have surprisingly poor performance on Haswell. I'm hoping someone can help me understand what's happening. The problem seems to happen when there is more than one write within a tight loop. The execution time is almost double what I'd expect when the writes are to L3 or RAM. Writes to L1 writes are (mostly) as expected. To my surprise, adding "obvious" prefetch statements seems to help a great deal.

Here are the sorts of loops I'm looking at:

void loop_forward_single(size_t length, uint32_t *array) {
    uint32_t *top = array;
    while(length--) {
        PREFETCH(top + 64);
    *top++ = 1;
    }
}

void loop_forward_double(size_t length, uint32_t *array) {
    size_t half = length / 2;
    uint32_t *top = array;
    uint32_t *middle = array + half;
    while(half--) {
        PREFETCH(top + 32);
        *top++ = 1;
        PREFETCH(middle + 32);
        *middle++ = 1;
    }
}

void loop_backward_single(size_t length, uint32_t *array) {
    uint32_t *bottom = array + length - 1;
    while(length--) {
        PREFETCH(bottom - 64);
        *bottom-- = 1;
    }
}

void loop_backward_double(size_t length, uint32_t *array) {
    size_t half = length / 2;
    uint32_t *middle = array + half;
    uint32_t *bottom = array + length - 1;
    while(half--) {
	PREFETCH(middle - 32);
        *middle-- = 1;
        PREFETCH(bottom - 32);
        *bottom-- = 1;
    }
}

void loop_to_middle(size_t length, uint32_t *array) {
    size_t half = length / 2;
    uint32_t *top = array;
    uint32_t *bottom = array + length - 1;
    while(half--) {
        PREFETCH(top + 32);
        *top++ = 1;
	PREFETCH(bottom - 32);
        *bottom-- = 1;
    }
}

"PREFETCH" is defined as a macro that can be turned on or off. Here's what I'm seeing on a Haswell i7-4770 with it off:

Testing with array length=1048576 without prefetch
 loop_forward_single: 1.16 cycles/element
 loop_forward_double: 1.75 cycles/element
loop_backward_single: 1.16 cycles/element
loop_backward_double: 1.75 cycles/element
      loop_to_middle: 1.93 cycles/element

I would have expected all the loops to be about the same speed. Instead, the loops with more two writes per iteration are taking much longer to execute. Activating the prefetch statements changes this:

Testing with array length=1048576 with prefetch on
 loop_forward_single: 1.07 cycles/element
 loop_forward_double: 1.14 cycles/element
loop_backward_single: 1.33 cycles/element
loop_backward_double: 1.14 cycles/element
      loop_to_middle: 1.07 cycles/element

This is the performance I would have expected without the prefetches. What's happening here? I'm wondering if this is related to the issue that John McCalpin described here: https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/594954#comment-1843639

The full test case is attached.

TimP · ‎11-12-2015

I suppose you would need non-temporal streaming stores to avoid performance depending on software prefetch, since hardware prefetch is interrupted at page boundaries.

According to your comments in the source code, you intend to prevent icc from engaging streaming stores, which it might do with default options (at least in the forward case, for a single stream). In order to engage streaming stores for 2 write streams per loop, you would need at least to ensure that the compiler sees both as aligned. With gcc, you would need to specify non-temporal explicitly (e.g. by intrinsics).

McCalpinJohn · ‎11-12-2015

(1) I have noticed large differences in store performance between the "client" (Xeon E3) and "server" (Xeon E5-26xx) uncores for L3 and memory-contained data with SNB. I have not looked at this in detail with Haswell clients (like your Core i7-4770), but Haswell EP handles stores (especially streaming stores) better than Sandy Bridge EP or Ivy Bridge EP.

(2) For this case it is important to note that hardware prefetchers are traditionally much more aggressive with loads than they are with stores. You might want to try running this as a read/modify/write operation, using a trick like the one IBM used on POWER3 (which did not perform hardware prefetches on sequences of store misses):

// original STREAM Copy code -- copy a[] to b[]
//   The HW prefetcher will prefetch a[] but not b[]
for (i=0; i<N; i++) b = a;

// "optimized" STREAM Copy code
//    The HW prefetcher will fetch both a[] and b[]
//    The scalar "zero" should be set to zero in a way that is not visible at compile-time,
//      either by reading the value from an external source or using a computation that
//      is known to produce 0
for (i=0; i<N; i++) b = zero*b + a;

(3) Related to (2), you might also want to try this with the hardware prefetchers disabled.

(4) I can't find the details in my brain right now, but I seem to recall that interleaving multiple store streams makes it much more difficult to pipeline the transactions while guaranteeing the processor's ordering model (particularly the requirement that all other cores see the updates in the same order). The worst case is when the stores go to different NUMA domains, but something similar might show up with stores being handled by different L3 slices in the Core i7-4770 (which appears to have a 4-slice L3). Often optimizations like interleaved cache slices that help best case throughput can hurt best-case latency and hurt worst-case throughput.

(5) Some of the poor performance I was seeing (in the discussion at https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/594954#comment-1843639) was attributable to unaligned loads crossing page boundaries. The compiler was ensuring that my stores did not cross page boundaries, but the nature of the algorithm was such that 7/8 of the loads in the loop cross page boundaries. The penalty for each of those page-crossing loads is not large, but the sum of the seven of them was significant. I figured out how to avoid these page-crossing boundaries, but decided not to implement that code, since it would have required 8 different implementations (one for each possible vector alignment between the input and output arrays).

Nathan_K_3 · ‎11-12-2015

John D. McCalpin wrote:
(1) I have noticed large differences in store performance between the "client" (Xeon E3) and "server" (Xeon E5-26xx) uncores for L3 and memory-contained data with SNB. I have not looked at this in detail with Haswell clients (like your Core i7-4770), but Haswell EP handles stores (especially streaming stores) better than Sandy Bridge EP or Ivy Bridge EP.

In general, store performance on this consumer Haswell has been excellent --- much better than Sandy Bridge.

(2) For this case it is important to note that hardware prefetchers are traditionally much more aggressive with loads than they are with stores. You might want to try running this as a read/modify/write operation, using a trick like the one IBM used on POWER3 (which did not perform hardware prefetches on sequences of store misses):

Interesting. As you suggest, switching to an increment rather than setting to 1 does speed up the "double" loops when no prefetch is used. The speed ends up about halfway between the prefetch and non-prefetch. It slows things down slightly for the "single" loops, and slows down more in cases when prefetch is already used.

(5) Some of the poor performance I was seeing (in the discussion at https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-ar...) was attributable to unaligned loads crossing page boundaries. The compiler was ensuring that my stores did not cross page boundaries, but the nature of the algorithm was such that 7/8 of the loads in the loop cross page boundaries.

I realize this was what you were guessing at the time, but I wondered if you were able to confirm it. I actually discovered this issue in the same situation that you did, with unaligned vector writes. I presumed that the problem was with the large cross page penalty, but when I changed my code to avoid the cross page writes, the poor performance remained. Eventually I switched to scalar writes, and then (not shown) to sparse scalar writes, without change in behavior. This made me wonder if you were able to confirm your initial diagnosis.

Tim Prince wrote:
I suppose you would need non-temporal streaming stores to avoid performance depending on software prefetch, since hardware prefetch is interrupted at page boundaries.

True, it's good practice, although I think this effect is much smaller than the performance degradation I'm observing. Also, it wouldn't explain the difference between the single write and double write loops. My actual use case involves overlapping vector writes, so a non-temporal store won't work for me.

--nate

McCalpinJohn · ‎11-13-2015

I was able to verify that the compiler was refusing to emit unaligned 256-bit stores without a "#pragma vector unaligned" directive, and that the penalty for adding that pragma was about 125 cycles per page crossing. The cumulative impact of the 7 loads that crossed each page boundary was about 220 cycles per page(about 30 cycles each). I figured out how to generate the code to do the computation without any page-crossing loads (using a combination of VBLENDPS and VPERM* operations) in ~25 cycles, but did not want to go to the effort of generating all 8 versions of that code (one for each possible relative alignment of the input and output vectors).

I have some codes that have multiple stores per loop, but these run better on Haswell than on Sandy Bridge (almost 10% better than the ratio of the STREAM bandwidths on the two systems), so either Sandy Bridge has the same issue with multiple stores or Haswell improves the read performance by more than it degrades the write performance.

Travis_D_ · ‎08-14-2019

Probably no one is waiting for an answer on this four years later, but I just came across this and wanted to note that I observed a similar effect on Skylake hardware (and checked that it does occur on Haswell as well).

The bottom line seems to be that an L1-hit store effectively forms a barrier between L1-miss stores before and after, preventing them from running in parallel. For example, if you have three consecutive L1 miss stores, they will usually run in parallel, and the total time to service them will often be similar to a single store miss since their handling is effectively overlapped. However, if you instead have the "middle" store be an L1-hit, like this (where M is a miss and H is a hit):

M1 H2 M3

Then M1 and M3 will be prevented from running in parallel.

Hence is it profitable to either prefetch store locations that are likely to miss (any type of prefetch will do, it doesn't need to be prefetchw), or try to organize your stores so that L1 hits and misses are grouped. E.g., if you have a loop where one store normally hits, and one normally misses you have the worst-case scenario: you might unroll by say 4x and group the hit and miss stores together so you get 4 misses followed by 4 hits. The 4 misses can run in parallel.

The effect doesn't always occur. On my Skylake machine it seems to depend on the microcode. With the newest microcode I find it always occurs, but with the old microcode it occurred only sometimes, apparently related to some hard-to-pin-down system state. I wrote more about it here, and you can find earlier discussion and analysis on this stackoverflow question and at RWT.