I was running some microbenchmarks on a Xeon E5-2660 v3 system and came across some very odd behavior in the L2 streaming prefetcher....
The test is a simple sum reduction of a single vector with size selected at compile time. The Intel 15 compiler generates completely reasonable code for every array size that I checked -- all vector, all aligned, no extra cruft.
The code is instrumented with calls to RDTSCP and one or more PMCs after each sum reduction. Processor frequency is pinned to 2.60 GHz.
For L1-contained data, the behavior is exactly as expected. Using array sizes of 128 to 1536, I can do a linear least squares fit to the data to see that the overhead is about 77 cycles (about the expected value for RDTSCP + 1 RDPMC) and that the per-element cost is just a hair under the theoretical rate of 4 elements per cycle (limited by the single VADDPD per cycle).
For L2-contained data, the performance is erratic, varying by more than 15%. I tracked this down to O(1) variations in the number of L2 Hardware Prefetches generated.
First, it is a bit weird that there are L2 HW prefetches, since there are no L2 misses after the initial iteration loads the data (and I discard the counts from that iteration). But it takes a while to confirm an L2 miss, and the cost a redundant prefetch is pretty low compared to the cost of a miss, so an aggressive policy is not intrinsically unreasonable.
Second, it is very weird that the number of L2 prefetches varies randomly from zero to about 90% of the cache lines transferred. I have looked at a lot of data and while there appear to be "phases" in the temporal variability of the L2 HW Prefetch rate, I can't come up with any way to make the behavior repeatable. Performance is best (and has the least variability) when the L2 Streaming Prefetcher is disabled. Over the size range from N=6144 (8KiB) to N=24576 (192KiB), the hardware prefetchers degrade the performance by an average of about 6.5%.
My (tentative) interpretation for the performance variability is that the L2 Hardware Prefetches conflict with L1 Data Cache misses that are trying to read from the L2 cache. This causes the L1 Data Cache read misses to be retried at the L2 -- slowing down the execution by somewhere between 0.5 cycles and 1.5 cycles for every retry.
Sandy Bridge shows the same variability in L2 HW Prefetch rates, but without the corresponding variability in execution times. I assume that this is because the peak L2 BW on the Sandy Bridge L2 is 32 Bytes/cycle, leaving a lot more cycles free for L2 cache tag accesses. Looking at the same size range as on Haswell, the hardware prefetchers *improve* the performance on Sandy Bridge by about 6.5%.
I can't think of any parameters that I am not controlling, and the number of L2 HW prefetches still varies randomly. If it weren't for the strong correlation with increased execution time I would probably assume that the counters were incorrect, but the correlation is way too strong.
Any thoughts on hardware mechanisms that would lead to this sort of consistent inconsistency?
After staring at lots more data, this is starting to look more like a "feature" than a completely weird behavior. :-)
When I plot the data vs time (rather than vs iteration number) I see that every millisecond there is a spike in execution time associated with the Linux kernel timer interrupt. This is not surprising, and I normally discard these elevated numbers.
BUT -- in addition to the spike in timings, there is a strong spike in the number of L2 Hardware Prefetches immediately following the interrupt. The number of L2 HW Prefetches shows a slow decrease in the subsequent iterations until the next interrupt, which causes it to spike again. The behavior is not strictly periodic, but the trend appears significant.
This may be a deliberate design feature. Since kernel code can cause user code to be evicted, it is plausible that making the L2 Hardware Prefetcher more aggressive immediately after a Ring 0 to Ring 3 transition would be of general benefit. The boost to the aggressiveness of the L2 Hardware Prefetcher would presumably degrade over time, eventually returning to a baseline level. In this case the degrading boost does not degrade fast enough, and the L2 HW Prefetches don't reach an asymptotic level before the next kernel timer interrupt occurs.
I have not seen such a mechanism described in the literature, but it seems plausible that it would help performance and it seems likely that the hardware implementation would be straightforward.
Perhaps an easier explanation is suggested by some of the wording in the Intel Optimization Reference Manual in the description of the L2 Hardware Prefetchers in the Sandy Bridge core (section 2.2).
In the description of the L2 "Streamer" prefetcher, there are several comments that point to dynamic prefetch policies. For example:
"If there are not many outstanding requests, the streamer prefetchers further ahead."
Presumably an interrupt handler will not generate anywhere near as many concurrent L1 misses as my L2-contained vector summation code, so on the return from an interrupt the L2 streamer prefetcher will start with an aggressive policy. As the code ramps up the number of requests (L1 Data Cache misses), the L2 prefetcher will become a bit less aggressive. This decrease in the number of L2 prefetches actually allows performance for L2-contained data to increase, which may result in further reductions in the aggressiveness of the L2 streamer prefetcher.
The time scale is longer than I would expect for this sort of nonlinear feedback process, but if coefficients of the state machines are tuned properly, it may be possible to generate this sort of behavior.
If the aggressiveness of the L2 prefetcher also depends on a (backwards) running average request rate, then this behavior should be relatively easy to create.
In either case, "hiccups" in the timing caused by the infrequent writebacks of timing data from the L1 Data Cache could cause the prefetcher to become more aggressive again (e.g., if the writeback gets in the way of L2 to L1D transfers and therefore either decreases the number of concurrent L2 requests or the rate at which the L2 is receiving read requests). Since my code does not attempt to align its iterations (temporally) with interrupts, the L1D writebacks will occur at essentially random times relative to the changes in L2 Prefetch aggressiveness produced by the timer interrupts.
Some combination of reducing the interrupt rate to 100 Hz and using non-temporal stores for the timing data may provide data useful for evaluating this hypothesis.... I can also modulate the L2 request rate (downward) by inserting dummy work into the code (e.g. repeating the arithmetic on the same addresses or modifying the 8 VADDPD instructions to access fewer than 8 different addresses). If the L2 PF rate is directly related to the L2 request rate, then this should provide a fairly strong signal.
After several years (subjective time) of additional experiments, it appears that this variability that I have been seeing only shows up when a code is getting very close to the maximum L2 bandwidth. It still looks like the L2 hardware prefetches fight with the L1 misses for access to the L2 cache tags, causing some of the L1 misses to retry their L2 accesses. These retries add to the execution time.
If I reduce the bandwidth demand from the maximum possible rate (i.e., the core requires at least one cache line per cycle from the L2 to match the core-limited performance) to 2/3 of the maximum rate (two cache lines every three cycles), the performance variability disappears back into the noise (under 0.5%). There are still lots of L2 HW PF requests, and there is still a lot of variability in the number of requests, but this no longer has a significant influence on performance.
So this is weird and confusing, but it is probably only an issue with synthetic microbenchmarks that try to achieve the maximum L2 bandwidth.