I don't know of any public documentation on the QPI protocol. Most of the information that I have been able to come up with is implicit in the performance counter events described in the uncore performance monitoring manuals for the various processor generations.... The information about the QPI protocol is spread across multiple sections of those documents, particularly the QPI Link Layer, the Ring to QPI (R3QPI) interface, but there is also a lot of information in the CBo and HA sections -- particularly in relation to matching on QPI message class and opcode.
You did not mention what sort of rates of HITM events you are seeing. Are these events happening at anywhere near the rates you are expecting for memory accesses for your benchmark, or could they be due to background OS activity?
Running linux with cores isolated, housekeeping threads moved out of the way, and no_hz mode. Should have said that, so not OS activity. I will confess that the remote HITM events I see are whatever perf c2c considers HITM events. perf c2c does show the line of the code the event occurred, so I am sure it is happening in my code and at expected places. It just seemed very odd to me that the modified state of one socket would be transferred to the other socket on a read-only request.
I was trying to find some additional resources. This paper was helpful. Maybe the read request is acting like the RFO case described, so it is moving the modified state to the other socket. With some additional prefetching, I seem to be able to hide most of the latency and hits that were occurring. I wish there was better documentation around QPI for sure.
There are lots of choices in how to handle Data Read requests that hit Modified lines.
- Most systems downgrade the modified line from M to S and provide the line to the consumer in S.
- Early protocols triggered a Writeback here, so memory would be consistent.
- Later protocols require either the source or destination to take the line in a dirty state (typically called "O", or "T" for some IBM processors)
- The O state is dirty, but read-only, so all the cached copies are the same, but not consistent with memory.
- The cache with the data in theO state must write the data back when the line is chosen as victim.
- The Wikipedia article on the MOESI protocol says that the O state is writable, but I have never seen such an implementation. Every system that I have worked on requires an upgrade to M on a write (which invalidates all shared/clean copies, rather than "broadcasting" the updated value as suggested in the Wikipedia page).
- If I recall correctly, the IA64 architecture defaulted to M -> M transfers on ordinary reads.
- This provides better performance for "migrating" lines used as communication buffers/semaphores/etc, but worse performance for shared data.
- Many of the newer processors have much more complex behavior.
- At the very least, behavior is different for local and remote NUMA accesses.
- In multi-level caches without forced inclusion, the cache level(s) at which the data is stored can depend on how "busy" each level of the cache happens to be when the request is generated.
- There is some evidence of dynamically adaptive behavior, with different cache state transitions chosen depending on the hardware's prediction of how the line will be used in the future.
- All of this becomes essentially impossible to understand for "closed-source" processors if a transaction happens while a line is in a transient state from a prior transaction.
- It is also nearly impossible to figure out what is happening if demand loads hit prefetches in flight. The required state transitions are typically fairly clear, but we have no idea if or how the hardware performance counters classify these transactions.
- Systems typically do not provide prefetches that are architected to match the actual cache states used. When software prefetches are provided, they are typically defined with "abstract" semantics. The specific implementations only cover a small subset of the allowable transaction types, and the mapping of prefetches to transaction types is not documented. (And the mapping may be dynamically adjusted by the hardware anyway....)
Thanks, John. Appreciate the insights. It will be interesting to see if the same thing is happening on Skylake Xeon, just got a test box, will try it out. I am running in early snoop mode as the latency profile seems better than HS w/ Dir+OSB mode, but I will be re-testing that as well for the worst case performance.