Quote:McCalpin, John wrote: - Page 2

Stephen1 · ‎10-03-2014

Hello,

I'm having problems achieving the 64 bytes/cycle L2 to L1 cache bandwidth on the Haswell core. I can only achieve 32 bytes/cycle. Does anyone know if this is possible? Several other reviewers on the web are having the same problem. Was this a typo on the Intel slides for Haswell? My current code is using integer loads. Is the claimed 64 bytes/cycle dependent on SSE loads? I'm stumped.

Thanks,

Stephen

Tom's hardware:

L1D, L2, and L3 cache bandwidth are all up on our Core i7-4770K compared to my preview piece. However, only the L1D yields the doubling of bandwidth we were expecting. Given 64 bytes/cycle (compared to 32 for Ivy Bridge), this number should still be much higher than it is.

http://www.tomshardware.com/reviews/core-i7-4770k-haswell-review,3521-12.html

www.extremetch.com:

We’re surprised that the L2 cache is just 7% faster on Haswell, but the massive L1 bandwidth is a huge jump.

http://www.extremetech.com/computing/157125-haswell-review-intels-core-i7-4770k-takes-over-the-pole-position/2

Travis_D_ · ‎12-20-2017

Dr. McCalpin,

Thanks very much for those results.

It's interesting you mention stores. I actually got onto this topic because of very unusual (and poor) performance for forward streaming writes totally contained in L2 (a 64 KiB buffer). As long as every write was in the same cache line as the last, or the next cache line, things are fast, but as soon as there is any interleaving of writes to different cache lines (including just having writes that always hit L1 mixed with a single stream of L2 writes, performance gets bad quickly (sometimes dropping to only one cache line every 20 cycles or so).

I ask about it over here on RWT and on StackOverflow.

McCalpinJohn · ‎12-21-2017

Linus Torvald's comment on the RWT thread is probably the closest, though it is hard to tell if he has the details right without being on the processor design team....

Most descriptions of cache coherence protocols skip over a great many nasty details concerning multi-level caches and concerning transient states. Both are relevant here.

Intel's memory consistency model requires that all Logical Processors in the system observe (ordinary) stores from a single Logical Processor in program order. This can be implemented in several different ways, each of which embodies different trade-offs with respect to performance for the various ordering cases and complexity of implementation. In my experience, many straightforward (simple and fast) implementation approaches can't be made to work, because some corner case cannot be caught in time to prevent a violation of the ordering model. One ugly case with the Intel processors is that the requirement that *all* other Logical Processors observe the stores in the same order includes a core's sibling (HyperThread) Logical Processor(s). This means that (at least) the store buffers must be kept separate for the Logical Processors sharing a core, and that sibling thread access to store results must (at least appear to be) via the same cache mechanisms used by other physical cores. Even though HyperThreading is disabled on your system, it is entirely possible that the implementation has performance characteristics that can be traced back to the need to maintain consistent ordering for the sibling HyperThread.

When a store misses in the L1 Data Cache, that L1 line is marked as being in a transient state, pending completion of the RFO or upgrade operation (depending on whether the data was in the L1 cache originally), and the data for the store is held in a store buffer. The L1 Data Cache controller allocates a Fill Buffer and transmits either an RFO or upgrade request to the L2. The store buffers cannot be snooped by external agents (including the sibling HyperThread, as noted above), but they must be snooped by the local Logical Processor to satisfy ordering requirements (a read after a store has to get the most recently stored value!) and an implementation may have many complex rules related to the performance of these "forwarding" scenarios. A second store to a different cache line will also go into a store buffer. The store buffer contains metadata defining the program order of the stores, either implicitly via a hardware FIFO structure, or explicitly via some other mechanism.

One of the primary reasons to have store buffers is to reduce the number of write cycles to the L1 cache. If the stores are smaller than the natural minimum granularity of the cache access (which may be a full cache line in recent processors), then updating the line in the L1 Data Cache will require a read-modify-write cycle in order to compute the updated error correction bits for the minimum granularity "chunk" that contains the updated bits from the store. For store buffers to be useful in reducing writes or read-modify-writes to the L1 Data Cache, the data must be held in the store buffer until the remainder of the "chunk" is written. Holding the data longer will increase the chance of avoiding unnecessary L1 read-modify-write cycles, but it will also require more store buffers and will increase the latency of true producer-consumer operations (since the store buffers are not snooped, a consumer cannot force the data to become visible more quickly). Based on my experience, I would assume that there are a number of heuristics in the implementation that determine when to attempt to commit the store buffers to the cache, and some of the mechanisms may be adaptive. In any case, for an L1 Data Cache miss, the store buffer cannot be committed until the L1 Data Cache receives the appropriate completion message(s) for the RFO or upgrade. In the case of an L2 hit (where the L2 line is in M or E state), the delay should be about the same as an L2 read latency -- something like 14 cycles on a Skylake Xeon processor.

A store following the L1 Miss/L2 Hit store will also put its data in the store buffer. In your case, the second store is to a fixed location in the L1 Data Cache. This clearly cannot be committed to the L1 Data Cache before the preceding L1 Miss/L2 Hit, but why are they not pipelined so that the second store simply commits immediately after the first? One possibility is that this is an unanticipated side effect of writing to the same location in the L1 Data Cache. As I noted above, writing 8 Bytes to the L1 Data Cache probably requires a full read-modify-write cycle on that line. Since every second store goes to the same L1 Data Cache line, the added latency is the read-write-modify latency of the L1 Data Cache. Subsequent L1 Miss/L2 Hit stores may or may not be able to overlap their latency with this read-modify-write-limited store. This hypothetical model would explain the observed behavior without requiring that the store buffers be flushed.

Of course it is also possible that something in this code forces the store buffers to be flushed, with a resulting serialization. The reason may or may not be comprehensible given publicly available information on the implementation.

Peter_Cordes · ‎12-21-2017

McCalpin, John wrote:

One of the primary reasons to have store buffers is to reduce the number of write cycles to the L1 cache. If the stores are smaller than the natural minimum granularity of the cache access (which may be a full cache line in recent processors), then updating the line in the L1 Data Cache will require a read-modify-write cycle in order to compute the updated error correction bits for the minimum granularity "chunk" that contains the updated bits from the store.

Are you sure / do you have a source for that? Travis's data does seem to support combining of adjacent stores to the same cache line (and that does make sense for stores that have retired and thus are both / all non-speculative), but read-modify-write would be surprising.

According to Paul A. Clayton, Intel's L1D caches use parity, not ECC at all, exactly because L1D has to deal with narrow and unaligned stores. (L2 / L3, and L1I, probably use ECC because they only have to track whole lines). I didn't manage to turn up anything for Skylake L1D parity or ECC with a quick search, but parity sounds very plausible.

A read-modify-write to update ECC bits seems unlikely; historically if you want to support single-byte stores in an ECC-protected L1D you'd use byte granularity ECC (at the cost of 50% overhead; this is one reason the designers of Alpha AXP cited for not including byte or 16-bit stores in that ISA), because read-modify-write takes additional hardware and is slow. (Related: this StackOverflow answer I wrote about byte stores on modern x86. Parts of it may be wrong; I'm not a CPU designer and some of it is guesswork on my part.)

For store buffers to be useful in reducing writes or read-modify-writes to the L1 Data Cache, the data must be held in the store buffer until the remainder of the "chunk" is written.

Couldn't the store buffer just look at adjacent stores in the queue and combine them if they're to the same line? There are multiple cycles to check for this after stores enter the queue but before they're ready to commit (i.e. from when the store uops execute to when they retire). Checking only adjacent stores makes sense because of x86 memory ordering requirements, and would make sense in hardware if the store buffer entries in the store queue are a circular buffer used in order, so consecutive stores in program order use physically adjacent buffers. (IDK if that's how it works; that would make sense if the front-end issued stores into store buffers as well as the ROB, but not if store buffers aren't allocated until store address/data uops execute out-of-order.)

This could effectively turn multiple stores to a line into a single full-line masked store, as far as actually committing data to the cache. (Presumably it can't release either store buffer, though). Skylake's L1D has to support AVX512BW vmovdqu8 with byte-granularity merge-masking, and previous CPUs had to support AVX vmaskmovps and so on, so HW support for committing non-contiguous data to cache probably already existed. (AVX masked stores do have perf penalties, but according to Agner Fog's testing, a VMASKMOVPS store is only 1 ALU uop + 1 store-address and store-data uop, and can execute 1 per clock, so it doesn't take any more cache-throughput resources than a normal store. So either masked and normal stores require a read-modify-write, or neither do.)

One possibility is that this is an unanticipated side effect of writing to the same location in the L1 Data Cache. As I noted above, writing 8 Bytes to the L1 Data Cache probably requires a full read-modify-write cycle on that line. Since every second store goes to the same L1 Data Cache line, the added latency is the read-write-modify latency of the L1 Data Cache.

This explanation would imply that a loop like this, alternating stores between two cache lines that stay hot in L1D, should suffer the same slowdown.

.loop: 
    mov  [buf], rbp           ; aligned store
    mov  [buf+128], rax       ; aligned store to another line
    sub  rbp, 1
    jg   .loop

But it doesn't. The loop runs at ~2.002 cycles per iteration, almost exactly one store per clock. (mov ebp, 100000000 before the loop, timed on Linux with perf to count cycles, on an i7-6700k). The read-modify-write latency theory can't explain this and the slowdown with L1D misses without some kind of speculative memory ordering stuff that rolls back when another core requests a line, so I think this is strong evidence against it (and further evidence against stores needing to read-modify-write at all).

Travis_D_ · ‎12-21-2017

Peter Cordes wrote:

This explanation would imply that a loop like this, alternating stores between two cache lines that stay hot in L1D, should suffer the same slowdown.
.loop: 
    mov  [buf], rbp           ; aligned store
    mov  [buf+128], rax       ; aligned store to another line
    sub  rbp, 1
    jg   .loop
But it doesn't. The loop runs at ~2.002 cycles per iteration, almost exactly one store per clock. (mov ebp, 100000000 before the loop, timed on Linux with perf to count cycles, on an i7-6700k). The read-modify-write latency theory can't explain this and the slowdown with L1D misses without some kind of speculative memory ordering stuff that rolls back when another core requests a line, so I think this is strong evidence against it (and further evidence against stores needing to read-modify-write at all).

I agree with you that it seems unlikely that a full RMW happens for every write that hits in L1, but I'm not sure that test proves it. As long as the RMWs can be overlapped which seems likely, the above loop only needs 2 RMWs, hence 2 reads and 2 writes every 2 cycles, which works in both the non-RMW and RMW scenarios. However, if you add some reads in there, you'll find that the loop still runs too fast for the RMW scenario:

.top:
    mov [rsi], rdi
    mov [rsi + 128], rdi
    mov rax, [rsi + 256]
    mov rcx, [rsi + 384]
    mov rcx, [rsi + 512]
    mov rcx, [rsi + 640]
    sub    rdi,1
    jne    .top

That's basically Peter's loop, with 4 reads added, and it still runs at about 2.1 to 2.2 cycles per iteration. So in this case the 2 RMW and the 4 reads all completing in 2 cycles would imply a third read port on L1. Of course, a third read port is certainly possible - but in general I haven't seen any advantage for "completing the cache line" for L1 contained loads: the L1 seems to behave mostly as an ideal cache with 2 read ports and 1 write port.

So this test doesn't disprove the RMW either, but it at least makes it less likely...

McCalpinJohn · ‎12-24-2017

It is very hard to get details on cache implementations. Parity is common for instruction caches (since the data can never be dirty), but it is problematic for a data cache that can contain dirty data -- a parity error would mean an unrecoverable data loss.

An example of a description of a L1 Data Cache that requires Read-Modify-Write for partial-cache-line updates is http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438f/BABHFGFF.html ; It is possible that there are alternate error correction schemes that can avoid this, but I don't know of any that can handle arbitrary granularity and alignment. I found a description of an alternate approach in http://people.ee.duke.edu/~sorin/papers/iccd06_perc.pdf, but have not read it yet. Some of the references in the last paragraph of section II of that paper may be of interest as well -- they seem more exotic than traditional SEC-DED schemes.

My (no longer recent) experience in working with the low-level details of the implementation of ordering/consistency impressed on me the difficulty of understanding this stuff without having full access to the implementation details and a full list of all of the ordering test cases used for validation of the processor. There is clearly a behavioral difference between the "alternating stores to L1 and L2" and "alternating stores to different lines in L1", but I am not confident that it is possible to identify the reason for this difference. For example, it would be very hard to distinguish between "this is the best that can be done because of some specific ordering scenarios" and "this could be made to run faster, but it would be extra work and the reference workloads used by the designers suggested that the engineering effort would be better invested in other optimizations".

Travis_D_ · ‎01-16-2019

McCalpin, John wrote:
It is very hard to get details on cache implementations. Parity is common for instruction caches (since the data can never be dirty), but it is problematic for a data cache that can contain dirty data -- a parity error would mean an unrecoverable data loss.

You are right, it is hard, but at least this thread on RWT seems to indicate that relatively recent Intel designs don't use ECC, only single-bit detection (i.e., something like parity). This is supported by a link to old Xeon documentation, since you could apparently run the L1D in a half-capacity mode that added ECC (presumably using the extra bits to perform the ECC function).

I was pointed there by the comments on this stackoverflow question.

McCalpinJohn · ‎01-17-2019

I can't force myself to read through the stackoverflow question and responses, but Intel processors all use a store buffer between the core and the L1 Data Cache. An obvious reason to use a store buffer is to reduce the probability of requiring a read-modify-write cycle at the L1 Data Cache. (RMW is probably required even for parity?)

Information about the error checking of the various caches is probably implicit in the documentation of the Machine Check Error subsystem for each processor model. I don't know how hard it is to find such documents....

Travis_D_ · ‎01-18-2019

McCalpin, John wrote:
I can't force myself to read through the stackoverflow question and responses, but Intel processors all use a store buffer between the core and the L1 Data Cache. An obvious reason to use a store buffer is to reduce the probability of requiring a read-modify-write cycle at the L1 Data Cache.

I assume you are talking about coalescing adjacent writes, but one can observe that even if you "defeat" this functionality of the store buffer, by sending a long stream of byte writes to different cache lines, you still can still achieve roughly one write per cycle even in concert with two reads per cycle. So coalescing may occur (there is some evidence it does), but it is not necessary to hide the extra read implied by ECC spanning multiple bytes.

(RMW is probably required even for parity?)

No, I think the idea is that given a parity-like checksum across an entire cache line or some larger-than-a-byte-region, you can recalculate it when modifying a byte without knowing the value of adjacent bytes. So yes, the parity bit (or bits) themselves need to be read and adjusted based on the change in parity of the byte - but that is totally internal to the L1 and easy: it is reading more bytes from the L1, implying use of a read port, that is problematic.

Information about the error checking of the various caches is probably implicit in the documentation of the Machine Check Error subsystem for each processor model. I don't know how hard it is to find such documents....

It could be - but it could also easily be the case that the documents don't go into the specific hardware implementation of the ED/EC check, since you don't really need to know that to respond to errors. It may not even need to distinguish between L1 and L2 errors at the level of the machine check interface (admittedly, I'm not sure about any of this).

McCalpinJohn · ‎01-22-2019

Byte-level parity does not require an RMW cycle on systems with stores that are guaranteed to be byte-aligned and integral-byte width. Parity on wider fields is a potential problem.

If the store buffer can handle stores to multiple cache lines (which seems almost certain), then the example above (#27, 2017-12-21) may not be telling us anything. Since the updates are to the same two cache lines every time, the store buffer could be eliminating most updates to the L1 Data Cache. The reads can move ahead of the commit of the writes to global visibility, and the writes have to be ordered, but don't have to be visible at any particular time.

Some results at https://nicknash.me/2018/04/07/speculating-about-store-buffer-capacity/ seem to confirm the Haswell store buffer capacity as 42 stores. The author notes that Intel store buffers can coalesce multiple writes to the same address(es), but there is no discussion of how many cache lines are used in the tests. The trace files at the corresponding github project (https://github.com/nicknash/GuessStoreBuffer) contain consecutive 32-bit (4-Byte) stores. It is not clear to me whether the store buffer ought to have an additional limitation on the number of cache lines for which stores can be buffered, but it adds an extra dimension that needs to be considered....

Travis_D_ · ‎02-01-2019

McCalpin, John wrote:
Byte-level parity does not require an RMW cycle on systems with stores that are guaranteed to be byte-aligned and integral-byte width. Parity on wider fields is a potential problem.

I guess one might usefully distinguish between three scenarios:

1) No RMW at all: in order to recalculate any EC/ED codes, only the newly written data is needed - no existing data needs to be read.

2) Limited RMW: in order to recalculate any EC/ED codes, only the new data bytes and the existing values of those same data bytes are needed, even if the EC/ED covers a larger region. This is the case, e.g., with parity.

3) Full RMW: in order to recalculate any EC/ED codes, the new data bytes and the existing or new value of the entire protected region is needed.

The assumption is that type (2) might be much less of a problem than (3), even though they are both somehow "RMW", since only the overwritten bytes are being accessed, so the operation is "local": as part of the write flow, the delta in parity can be output without accessing any larger part of the region.

If the store buffer can handle stores to multiple cache lines (which seems almost certain), then the example above (#27, 2017-12-21) may not be telling us anything. Since the updates are to the same two cache lines every time, the store buffer could be eliminating most updates to the L1 Data Cache. The reads can move ahead of the commit of the writes to global visibility, and the writes have to be ordered, but don't have to be visible at any particular time.

I believe the store buffer coalesces adjacent writes to the same cache line, but there is evidence and good reason to believe that it does not do so in general for non-adjacent writes. In particular, it is very hard to maintain store-store ordering in that scenario, since coalescing non-adjacent stores essentially re-orders them. Yes they don't have to become visible "at any particular time", but I'm not sure how that helps you outside of limited special cases: in general you'll have to make a modified cache line visible at some point, and if that line has modifications that you've moved backwards in time (coalesced younger stores to this line across older stores to other lines), you immediately have a problem.

In any case, you can extend the test even more, with writes to any number of lines in whatever pattern: even if you store to 100s of distinct lines in no particular pattern, with no opportunity for coalescing, you will still observe roughly 1 store/cycle throughput if they fit in L1 - indicating that to the extent that any RMW exist, they are generally invisible performance-wise.

Some results at https://nicknash.me/2018/04/07/speculating-about-store-buffer-capacity/ seem to confirm the Haswell store buffer capacity as 42 stores. The author notes that Intel store buffers can coalesce multiple writes to the same address(es), but there is no discussion of how many cache lines are used in the tests. The trace files at the corresponding github project (https://github.com/nicknash/GuessStoreBuffer) contain consecutive 32-bit (4-Byte) stores. It is not clear to me whether the store buffer ought to have an additional limitation on the number of cache lines for which stores can be buffered, but it adds an extra dimension that needs to be considered....

Yes, it would be an interesting test. It is made harder by the fact that coalescing is mostly invisible, since the store execution rate (1/cycle), matches the rate that lines can be committed to L1 (1/cycle), so coalescing helps with the latter but doesn't help with the former so the bottleneck remains at 1/cycle. However, you can observe it in cases where the store buffer would otherwise will up, or where the L1 commit rate is less than 1/cycle (e.g., because there are misses, or contention with incoming traffic from L1). Perhaps one could also observe the store buffer occupancy by timing how long an mfence or locked instruction takes.

McCalpinJohn · ‎02-01-2019

It might also be interesting to try a test using byte writes (and hitting enough different addresses to be sure that the store buffer is not enabling elision of cache accesses). If the parity is computed over blocks larger than bytes, then RMW would be required.

Travis_D_ · ‎02-02-2019

McCalpin, John wrote:
It might also be interesting to try a test using byte writes (and hitting enough different addresses to be sure that the store buffer is not enabling elision of cache accesses). If the parity is computed over blocks larger than bytes, then RMW would be required.

I may not have been clear enough above, but in the middle part of my last reply, I was saying I have done this test. Pretty much up to the size of the L1, any pattern of stores, even when lines do not repeat for a long time (or never repeat, for a short test), precluding any coalescing, you still achieve 1 store per cycle.

So while there internally might be a RMW implementation, it entirely or almost invisible performance-wise (i.e., doesn't steal resources from other operations and doesn't reduce the store throughput).

It seems entirely plausible to me that the EC/ED code can be calculated over a larger region than a byte, but without accessing any general purpose read port of the cache. I don't know if you want to call that an RMW or not.

McCalpinJohn · ‎02-04-2019

I don't know how one would implement parity on (for example) 8-byte blocks that could be updated for single-byte writes without reading something, but maybe I am just dense....

For Haswell, it looks like no one has found a patterns of stores that suggest interference with loads, but this could be a model-specific feature. Note that in the Table 2-1 of the Intel Optimization Manual (248966-040), Haswell/Broadwell are reported to have a maximum L1 Data Cache bandwidth of 96 Bytes/cycle and a "sustained" bandwidth of 93 Bytes/cycle (97% of peak), while Skylake Server has a peak L1 BW of 192 Bytes/cycle and a "sustained" bandwidth of 133 Bytes/cycle (69% of peak).

For Haswell, my interpretation is that the L1 Data Cache has two 64-Byte read ports (see notes at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/673649#comment-1880825), plus a write port that is at least 32 Bytes wide. The 2x64-Byte read port configuration is probably enough for Skylake Server as well, though I have not run the same tests of throughput for aligned and unaligned loads of all possible sizes on SKX. I don't currently have a hypothesis on why SKX shows such a huge drop in L1D throughput when stores are added to the mix, and have not done testing to see how the throughput depends on the SIMD width of the loads and stores.

Travis_D_ · ‎03-01-2019

McCalpin, John wrote:
I don't know how one would implement parity on (for example) 8-byte blocks that could be updated for single-byte writes without reading something, but maybe I am just dense....

As I mentioned, you have to "read" something but that something is simply the bytes you are overwriting (strictly speaking you don't even need the byte values, simply their total parity) . So in that sense a write doesn't imply read of other bytes, just a read of the bytes that are being overwritten anyways, and this seems a lot easier: not needing a full read port or anything like that. Conceptually it could be done entirely within the array itself.

Anyways, maybe that clears up the confusion, but if not, it works because xor is commutative and associative:

Given the existing parity P for a bit string, and changing some subset of the bits with new parity N and with old parity O, the new parity is simply P ^ N ^ O. So you can update the parity without reading any of the other bits, which is in contrast to more advanced ECD code (as far as I am aware).

Haswell L2 cache bandwidth to L1 (64 bytes/cycle)?