Silent Eviction

LenchnerAlla · ‎07-20-2022

Hello, I have a question regarding SKL processor. I assume MESIF.

I would like to understand what happens exactly in silent eviction.

Let's assume the following scenario:

A clean cache line is in "Exclusive" state on a remote node and it goes through silent eviction. Obviously Home node directory of this line is not updated.

What happens when home node tries to read it from the remote node?

Assuming the home node sends requests both to local and remote DRAMs, which latency will be reported for this scenario? Is it likely that the latency will be counted as from the remote DRAM, even though the response arrived eventually from the local one?

Thanks, Alla

McCalpinJohn · ‎07-20-2022

I think you mean "SKX" processor -- "SKL" processors are single-socket only.

The detailed sequence of events will depend on the timing of the requests and the load on the system.

When the remote node makes the initial request for the line, the home node provides the line, marks the directory bit as "possibly dirty", and re-writes the line to DRAM.

When a local core requests the same line, the CHA will miss in the L3 and Snoop Filter and will send the request to the local IMC. If the UPI interfaces are lightly loaded, the snoop will be sent to the other chip in parallel. If the UPI interfaces are heavily loaded, the snoop will be deferred until the data returns from local memory.

Once the line is returned from local memory, the CHA will see that the "possibly dirty" bit is set, making the snoop required. If the snoop has already been sent, it just waits for the snoop response, otherwise, it sends the (deferred) snoop.

For a cache line that has been silently evicted, the snoop response will report a cache miss, so the data from memory is valid and the CHA can forward the data from memory to the requesting local core. The CHA will then create a Snoop Filter entry indicating E state at the requesting core.
If the cache line has not been evicted it will still be in E state in the remote cache. Most systems will downgrade the remote copy from E to S and provide the data (from memory) in S state to the local requesting core. The IMC will clear the "possible dirty" bit in the directory and re-write the cache line to DRAM.
If the cache line was modified and is still in the remote cache, then the snoop response will indicate a dirty hit. There are several different possible flows here (note that these flows might be chosen dynamically by the hardware based on history and utilization information):

Force the dirty data to be written back from the remote node to the home node DRAM, downgrading the line to S or I state. (This will automatically include clearing the "possibly dirty" memory directory bit as part of the write to DRAM.) Then the line can be forwarded to the local requestor.
(AMD processors) Downgrade the line to "O" (Owned) state in the remote cache and forward a copy to the requestor. Memory is not updated. Because memory is not updated, lines in "O" state cannot be silently dropped (they must be written back to memory upon eviction) and if a line is in "O" state in a cache, that cache must respond with the data on any snoop requests.
Invalidate the dirty copy in the remote node and return the dirty data in M state to the requesting core. This avoids unnecessary updates to memory in the case of cache lines that are moved around a lot. This is often referred to as "migratory" behavior.

It is possible for a snoop response to be returned before the data from memory, but that should be rare. This case does not present any problems -- a clean snoop response simply means that the CHA has to continue to wait for the data from memory.

What about the "reported latency" -- that depends on who is doing the reporting! If you are talking about the latency reported by the PEBS facility, then it will be the latency seen by the core. Any serialization of coherence operations (both those required for correctness and those implemented as snoop bandwidth reduction opportunities) at the CHA will show up before the data is returned to the core, so the core will only see the "net" latency.