Solved: Re: What is the ItoM IDI opcode in the performance monitoring events and how is it different to RFO

Lewis_Kelsey · ‎02-05-2021

I have absolutely verbatim searched everywhere for what the difference is and nobody knows. Some descriptions say it's PCIe related (so I guess DMA traffic from IIO), although there is already a PCIeItoM or PCIItoM, suggesting that it actually originates from cores->LLC rather than LLC->cores. This is supported by OFFCORE_REQUESTS.DEMAND_RFO 'Counts the demand RFO (read for ownership) requests including
regular RFOs, locks, ItoM.' What is the difference between RFO and ItoM sent by a core, and if it's a matter of partial vs full cache line, why does it need to distinguish and what is the benefit of distinguishing these opcodes?

McCalpinJohn · ‎02-05-2021

In the Scalable Memory Family Uncore Performance Monitoring Reference Guide, ItoM shows up in five sections -- each of which provides a small clue.

Section 2.2.9 CHA Box Derived Events
- FAST_STR_LLC_[HIT,MISS] is described as being the number of "ItoM (fast string) operations" measured at the LLC.
Section 2.2.10 CHA Box Event List
- Event IODC_ALLOC, Umask INVITOM is the "Number of IODC Allocations"
- The IODC is described at https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html
Section 2.5.4 IRP Box Event List
- Event COHERENT_OPS, Umask PCITOM is "PCIItoM"
Section 3.1.1 Reference for CHA Packet Matching
- Table 3-1, Opcode 0x248 ItoM "Request Invalidate Line -- Request Exclusive Ownership of cache line"
Section 3.1.2 Reference for UPI LL Packet Matching
- Table 3-7 InvItoM "Invalidate to M state. Requests exclusive ownership of a cache line without receiving data and with the intent of performing a writeback soon afterword."

So this event is not like an RFO because it does not request a copy of the cache line.
It is not exactly like a streaming store because the requesting agent is going to retain the data in a cache (either a processor cache or the specialized IO Directory Cache).

This looks a lot like what any other protocol would use for "upgrade" requests. E.g., a data cache line was loaded in S state and now you want to write to it. You don't need to read the data again, but you do need to invalidate the line in any other caches and make sure that any directories (e.g., Snoop Filter, Memory Directory) track the line as M state.

I don't see any other transactions in Table 3-1 that look like upgrades, but it would be relatively easy to misunderstand what is being presented. Testing these hypotheses is mostly straightforward, but tedious....

View solution in original post

McCalpinJohn · ‎02-05-2021

In the Scalable Memory Family Uncore Performance Monitoring Reference Guide, ItoM shows up in five sections -- each of which provides a small clue.

Section 2.2.9 CHA Box Derived Events
- FAST_STR_LLC_[HIT,MISS] is described as being the number of "ItoM (fast string) operations" measured at the LLC.
Section 2.2.10 CHA Box Event List
- Event IODC_ALLOC, Umask INVITOM is the "Number of IODC Allocations"
- The IODC is described at https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html
Section 2.5.4 IRP Box Event List
- Event COHERENT_OPS, Umask PCITOM is "PCIItoM"
Section 3.1.1 Reference for CHA Packet Matching
- Table 3-1, Opcode 0x248 ItoM "Request Invalidate Line -- Request Exclusive Ownership of cache line"
Section 3.1.2 Reference for UPI LL Packet Matching
- Table 3-7 InvItoM "Invalidate to M state. Requests exclusive ownership of a cache line without receiving data and with the intent of performing a writeback soon afterword."

So this event is not like an RFO because it does not request a copy of the cache line.
It is not exactly like a streaming store because the requesting agent is going to retain the data in a cache (either a processor cache or the specialized IO Directory Cache).

This looks a lot like what any other protocol would use for "upgrade" requests. E.g., a data cache line was loaded in S state and now you want to write to it. You don't need to read the data again, but you do need to invalidate the line in any other caches and make sure that any directories (e.g., Snoop Filter, Memory Directory) track the line as M state.

I don't see any other transactions in Table 3-1 that look like upgrades, but it would be relatively easy to misunderstand what is being presented. Testing these hypotheses is mostly straightforward, but tedious....

Lewis_Kelsey · ‎02-06-2021

Oh I see now.. it's assumed that the full line is going to be written to so it doesn't need a copy of the data already in the line, and it already has the data if it's in any other state (S, E, M). A theoretical StoI is the same thing as an RFO, same for E, all except for I, where ItoM and RFO differs in that the LLC doesn't need to send the data to the core for an ItoM. The name emphasises only the state changes. How it knows the whole line is going to be written to by stores I dont know.. maybe the L1d cache can squash a bunch of sequential senior stores in the MOB all at once while it allocates a LFB, because the RFO is sent immediately upon allocation I thought (and then retires them all once the RFO arrives). I guess it has some further time for stores to arrive in the LFB (L2 lookup) before the opcode has to be generated

McCalpinJohn · ‎02-06-2021

There are several cases in which the processor knows that the full line is going to be overwritten, but this can depend a lot on the implementation.

512-bit (aligned) stores write a full cache line, so (unless masking is used) there is never a need to read the previous values in the line. It may be valuable to keep the newly modified line in some level of the cache hierarchy, so streaming stores are only sometimes the right answer.
Other stores can aggregate to a full cache line before the L1 Data Cache miss is processed. Such timing-based optimizations will be opportunistic (while the optimization for the 512-bit aligned store can be deterministic), so it is would be most likely for 256-bit aligned stores and least likely for 8-bit stores.
The "fast string" operations mentioned in the CHA discussion have the string length available to the HW, so it can optimize for full-line writes.
ARM processors can perform this optimization automagically using history-based predictors to activate "dynamic read allocate mode". (ARM also supports non-temporal/streaming stores, but this dynamic mechanism gives most of the advantages of streaming stores with few of the drawbacks.). ARM processors support many modes of operation, so it can be a bit difficult to find the appropriate documentation. A good description of this mode is at https://developer.arm.com/documentation/ddi0489/f/memory-system/l1-caches/dynamic-read-allocate-mode