Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

SKL - strange memory behavior

aric
Beginner
4,277 Views

Hi,

I'm using a dual SKL socket system. System configured with NUMA disabled. All other memory interleaving configuration are set to default.

I wrote a simple test code that allocates 1GB buffer and just writes endless sequential 8B (uint64) to this buffer. Program is run on a single core only.

In parallel I monitor memory behavior using intel's PCM. Following is a snapshot of pcm-memory.x output:

 

|---------------------------------------||---------------------------------------|

|--             Socket  0             --||--             Socket  1             --|

|---------------------------------------||---------------------------------------|

|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|

|---------------------------------------||---------------------------------------|

|-- Mem Ch  0: Reads (MB/s):   254.64 --||-- Mem Ch  0: Reads (MB/s):   254.70 --|

|--            Writes(MB/s):   252.94 --||--            Writes(MB/s):   504.15 --|

|-- Mem Ch  1: Reads (MB/s):   254.76 --||-- Mem Ch  1: Reads (MB/s):   254.28 --|

|--            Writes(MB/s):   253.00 --||--            Writes(MB/s):   503.39 --|

|-- Mem Ch  3: Reads (MB/s):   254.87 --||-- Mem Ch  3: Reads (MB/s):   254.73 --|

|--            Writes(MB/s):   253.12 --||--            Writes(MB/s):   504.18 --|

|-- Mem Ch  4: Reads (MB/s):   254.81 --||-- Mem Ch  4: Reads (MB/s):   254.64 --|

|--            Writes(MB/s):   252.99 --||--            Writes(MB/s):   504.15 --|

|-- NODE 0 Mem Read (MB/s) :  1019.08 --||-- NODE 1 Mem Read (MB/s) :  1018.36 --|

|-- NODE 0 Mem Write(MB/s) :  1012.05 --||-- NODE 1 Mem Write(MB/s) :  2015.87 --|

|-- NODE 0 P. Write (T/s):     101012 --||-- NODE 1 P. Write (T/s):     101069 --|

|-- NODE 0 Memory (MB/s):     2031.13 --||-- NODE 1 Memory (MB/s):     3034.23 --|

|---------------------------------------||---------------------------------------|

|---------------------------------------||---------------------------------------|

|--                 System Read Throughput(MB/s):       2037.45                --|

|--                System Write Throughput(MB/s):       3027.91                --|

|--               System Memory Throughput(MB/s):       5065.36                --|

|---------------------------------------||---------------------------------------|

 

Hope you can read through the non aligned text capture:

In the local socket (socket 0) mem reads are equal to mem writes. I assume the reads are simply write cache misses that required a prefetch of the relevant CL.

However, on the remote socket (socket 1), the write BW is doubled. In the example above, my code writes a total of 2000MB/s, that should have been spread 1000GB/s per each socket, but here it’s 1000MB/s + 2000MB/s.

Any ideas why?

Thanks.

0 Kudos
10 Replies
McCalpinJohn
Honored Contributor III
4,277 Views

Both of your observations show the expected behavior. 

(1) Unless you use streaming stores, any store that misses in the caches will have to read the cache line from memory before updating it.  The cache transaction generated by a store miss is called "Read For Ownership" (RFO), which combines a data read with an invalidation of the line in the caches of all other cores.

(2) The write traffic generated by reads to the remote socket is required to support the "memory directory" feature that is enabled by default in 2-socket SKX systems.  (I think it was only enabled by default in 4-socket and 8-socket systems previously.)

(I feel sure that this topic has been discussed on the forums before, but I am not able to find the reference right now.)

The reason for the behavior is described in slides 7 and 8 of my April 2018 presentation:

https://www.ixpug.org/documents/1524216121knl_skx_topology_coherence_2018-03-23.pptx

The short explanation is:

  • A "memory directory" is one or more bits per cache line (hidden in the error correction bits) that indicate(s) whether a cache line might have a dirty copy in another socket.
  • For local reads, if the bit shows "clean" (no possibility of a dirty copy in another socket), then there is no need to issue a snoop request to the other socket(s).  If a snoop request was already sent (e.g., under light loading of the UPI interface), there is no need to wait for the response.  
  • For remote reads of data that is not in any cache, the default response is to provide the data in the "Exclusive" state.  This is a protocol optimization that allows a core to write to an unshared cache line without requesting further permission.  Unfortunately, this means that the "memory directory" bit must be set in that cache line in the home node, and the entire cache line must be written back to DRAM so that the updated "memory directory" bit is not lost.
0 Kudos
HadiBrais
New Contributor III
4,277 Views

Unfortunately, this means that the "memory directory" bit must be set in that cache line in the home node, and the entire cache line must be written back to DRAM so that the updated "memory directory" bit is not lost.

I don't see why the entire cache line has to be written to memory just to update the memory directory coherence state of the line. Even if the write bandwidth performance counter is incremented by 64 bytes for each directory update, that still doesn't necessarily mean that 64 bytes are actually written, but it could be that it is just the way the counter works. Does a memory directory lookup (on a directory cache miss) require reading the whole 64 byte cache line?

Slide 6 of the presentation seems to say that the memory directory feature is new on SKX. But I think that it is supported since Sandy Bridge on the E5 and E7 processors (there are a couple of perf events related to the directory mentioned in the uncore manuals). The directory cache was then added since Haswell.

Slide 7 says:

Snoop response time from the other socket is almost always larger than the latency to get data from DRAM

My understanding is that fetching a line from a remote L3 is almost always faster than from the local memory node. So I don't think the snoop response time would be significantly different from fetching a line from a remote L3. But I think the main point here is that not having to wait for the snoop response reduces latency, not that snoop response time is larger or smaller than memory read time. This is one of the benefits of the memory directory.

0 Kudos
McCalpinJohn
Honored Contributor III
4,277 Views

When the directory bit is modified, the whole cache line has to be written back for two reasons: (1) The ECC code for the line must be updated, which (depending on the implementation) can require many bits to be changed, and (2) The default granularity of writes to DDR4 is 64 Bytes (72 with ECC), so there is no benefit to writing less than the full line.  (The write can be "chopped" to half this length, but there is no performance benefit in doing so.)  On KNC (Xeon Phi x100), the main memory was GDDR5, and the GDDR5 "write data mask" feature was used to enable partial-cacheline updates of the ECC bits, but the implementation of ECC on KNC was unlike anything else you are likely to run across....)

My comment about snoop response time from the other socket being higher than local latency was a broad generalization applying to the case of data that ends up coming from local DRAM.   For data that is in a cache in the remote socket, there are many cases to consider.   If the cache line is dirty (or forwardable) in the remote L3, the latency may be lower, but the protocol flow depends on the "home" for the address.  If the address is remotely homed, the snoop will be sent as soon as the local L3 & Snoop Filter confirm a miss.  If the address is locally homed, the remote latency depends on whether the local processor broadcasts the snoop to the other socket in parallel with the local memory read.   This will typically happen if the QPI/UPI load is low, but if the load is high, then the snoop will be deferred until the cache line returns from memory and the directory bit can be examined.   If the data is dirty in an L1 or L2 on the remote socket, the latency will be higher than if it is dirty (or otherwise forwardable) in the L3, since the address must be looked up in the L3 & Snoop Filter first to discover that the line is dirty in an L1 or L2.   The timing will depend on the uncore frequency of both chips and on the core frequency of the core holding the data (for L1 and L2 interventions).  

0 Kudos
HadiBrais
New Contributor III
4,277 Views

According to the source code of pcm-memory, the first counter of the IMC is programmed to count Event 0x4 and Umask 0x3 and the second counter of the IMC is programmed to count Event 0x4 and Umask 0xC. Theses events correspond to the number of CAS commands for reads and writes, respectively. The bandwidth calculations seem to assume that each CAS command performs a transfer of 64 bytes (full cache line). This is probably a very reasonable assumption for most applications, but it is not always true. In particular, when the WPQ becomes (nearly) full and all of the pending requests in the queue are still partial writes, the IMC will probably just perform some of the partial writes. The DDRx protocol supports a bus width of 8 bytes, but the burst feature allows accessing 64 bytes using a single command. In addition, each byte has an enable signal.  I'm not sure though how the ECC byte is handled by the protocol (whether it has its own enable signal and who calculates ECC). Also we don't know whether a directory update requires modifying more than one ECC byte.

If you agree that all of the above is correct, then updating the directory state of a line doesn't necessarily require updating the whole line and it's just an assumption made by pcm-memory. For example, it could just disable all of the enable signals for each of the 8 bytes and just update the ECC byte, if it only needs to update a single ECC byte. But I don't know if this is possible, depending on how exactly the ECC byte is handled. I'm thinking that for the IMC to be able to repurpose the usage of some of the ECC bits to implement the memory directory, the protocol should allow it to control how each of the ECC bits is calculated. According to slide 29 of this Intel presentation, only 2 bits per cache line are required for the memory directory on Haswell, and there is no indication that this has changed on Skylake as far as I know.

0 Kudos
McCalpinJohn
Honored Contributor III
4,277 Views

According to the JEDEC spec, DDR4 does not support write masks for x4 DRAM configurations, so it can't be ubiquitous.   With x8 DRAMs, write masking is a configuration option, but only one of the three options [Termination data strobe, Write Masking, Data Bus Inversion] can be enabled at the same time.  I don't know when the other two options are useful, but the processor certainly can't count on write masking being available.

I don't see any evidence that Intel server processors can be configured to enable DDR4 write data masks.  In the SKX uncore performance montoring guide, for example, partial line writes (e.g., from partially filled write-combining buffers) are associated with "underfill reads", which suggests that the merge happens in the memory controller and the full line is written back.

Sadly, no one gave me a DRAM logic analyzer for my birthday, so I can't look at the actual bits and see how all of this is handled at the lowest level.

0 Kudos
HadiBrais
New Contributor III
4,277 Views

What we know for sure is that a single CAS command is used to update the memory directory, which causes a CAS for write event to occur. The pcm-memory tool multiplies each such event by 64 bytes to measure bandwidth, which is probably a good approximation on average. This is how the uncore manual mentions that "memory bandwidth" can be measured per channel. But It could work like you said, if write masking is not supported by the DIMM modules or by the memory controller (in which case multiplying the number of CAS events by 64 bytes would be an accurate measurement of bandwidth). But note that it can also just write 8 bytes in that case.

Sadly, no one gave me a DRAM logic analyzer for my birthday...

:)

0 Kudos
McCalpinJohn
Honored Contributor III
4,278 Views

"Bandwidth" can mean a lot of things.   In the case of masked stores (PCIe or GDDR5), all the data is being sent, plus additional data for the masks, but only the enabled bytes are getting written to the target.  Whether that constitutes a reduction in bandwidth depends on exactly where you are counting the traffic.

DDR4 does support "burst chop 4", which allows reading or writing 32 Bytes instead of 64 Bytes, but this only saves a small number of cycles in a few cases (relative to the full 64-byte write), so it is infrequently used.  Most (not all) ECC schemes work on the bits provided by 1-2 bursts, so it should be possible to update 1/2 of a cache line (including ECC) without needing to read the other half.   In the case of updating one or two directory bits, the line has already been read (to send to the other socket), so (depending on exactly how many directory bits are present and where they are located in the cache line's data+ECC memory) it should be possible to write 32 Bytes + ECC.   I have not seen any evidence in the uncore performance monitoring manuals that Intel actually uses burst chop writes, but they certainly don't document everything!

0 Kudos
Cao__Henry
Beginner
4,278 Views
-- deleted due to bad formatting --
0 Kudos
Cao__Henry
Beginner
4,278 Views

McCalpin, John (Blackbelt) wrote:

(1) Unless you use streaming stores, any store that misses in the caches will have to read the cache line from memory before updating it.  The cache transaction generated by a store miss is called "Read For Ownership" (RFO), which combines a data read with an invalidation of the line in the caches of all other cores.

If I have a producer write 200 bytes to a ringbuffer in shared memory, there would have multiple store misses (one per cache line) (because the destination is a ringbuffer).  Does the producer gets blocked until a) data transferred from the main memory, b) invalidating the cache lines in all other cores completes, c) both a and b?  I am also using SKL Xeon.

Based on your slide #7, "Memory latency has been increasing with core count".  It sounds like the producer would be blocked till invalidating the cache lines in all other cores completes?
 

McCalpin, John (Blackbelt) wrote:

For remote reads of data that is not in any cache, the default response is to provide the data in the "Exclusive" state.  This is a protocol optimization that allows a core to write to an unshared cache line without requesting further permission.  Unfortunately, this means that the "memory directory" bit must be set in that cache line in the home node, and the entire cache line must be written back to DRAM so that the updated "memory directory" bit is not lost.

So if a remote reader reads the 200 bytes of data, (assuming 200 bytes aren't in L1/L2/L3; or at best they are invalid as it is a remote ringbuffer) it will trigger something (what's that?) to write the corresponding cache lines back to DRAM.  If yes, will get the reader get blocked until the write to DRAM is complete?
 

0 Kudos
McCalpinJohn
Honored Contributor III
4,278 Views

If I have a producer write 200 bytes to a ringbuffer in shared memory, there would have multiple store misses (one per cache line) (because the destination is a ringbuffer).  Does the producer gets blocked until a) data transferred from the main memory, b) invalidating the cache lines in all other cores completes, c) both a and b?  I am also using SKL Xeon.

The details depend on lots of details of the hardware, the specific user software, and the timing of the transactions....

When I am benchmarking producer/consumer transactions, I repeat the operation many times, so the coherence transactions associated with the first iteration are not important for the overall performance.   My microbenchmarks don't include any other code, so the buffers never get displaced from the caches by unrelated memory accesses.  In this case, the producer starts each iteration with the buffer in cache, but without write permission ("S" state) -- because the consumer read the buffer in the previous iteration.   The producer's store is placed into the core's store buffer, and an "upgrade" transaction is sent to the system.  The "upgrade" transaction requests that the line be invalidated in all other private caches, giving the producer's cache write permission.  From Table 3-1 of the Xeon Scalable Processor Uncore Performance Monitoring Reference Manual (document 336274-001) it looks like Intel calls this an "ItoM" (Request Invalidate Line) transaction on SKX.  The producer core does not "block" on these stores -- the address and data are handed off to the store buffer and (in program order) the store instruction is retired.  Cache coherence transactions are handled autonomously and asynchronously by the cache controllers.   The cache controller(s) do "block" on the transaction -- the cache lines containing the store targets cannot be updated until the corresponding invalidate requests from the upgrade transactions have been acknowledged.

Based on your slide #7, "Memory latency has been increasing with core count".  It sounds like the producer would be blocked till invalidating the cache lines in all other cores completes?

The producer core is not blocked, but the producer's cache controller cannot update the line in its cache(s) until the invalidations have been acknowledged.  These invalidations can be either "directed" (targeting a single cache) or "broadcast" (sent to all caches in the same chip).  Directed invalidations are clearly appropriate for lines in "M" or "E" state (since those can only be in one cache), but it is also possible to use directed invalidations for lines in "S" state if a directory (such as the "HitME" cache) is able to track specific core(s) whose caches might contain the shared lines.  If such an infrastructure exists, the decision to use directed or broadcast invalidations will be based on the number of sharers, and may also vary dynamically based on the load on the request and response buses. 

So if a remote reader reads the 200 bytes of data, (assuming 200 bytes aren't in L1/L2/L3; or at best they are invalid as it is a remote ringbuffer) it will trigger something (what's that?) to write the corresponding cache lines back to DRAM.  If yes, will get the reader get blocked until the write to DRAM is complete?

It is not typically necessary to wait for writebacks to DRAM.  Early processors often did this to make the protocol easier, but this case is common enough to have led to protocol optimizations a long time ago.  The "O" ("Owned") state allows a cache holding modified data to forward the data to the requester, while keeping track of the fact that its copy of the data is still "dirty" with respect to the (stale) value in memory.  The O-state line must be written back to memory eventually, but this is not in the critical path of the cache-to-cache transfer.  (https://en.wikipedia.org/wiki/MOESI_protocol)

0 Kudos
Reply