Can someone please explain the difference between:
PCIeWiLF - PCIe Write transfer (non-allocating) (full cache line)
PCIeItoM - PCIe Write transfer (allocating) (full cache line)
Or point me to relevant documentation.
Thanks in advance,
I have also been unable to find documentation on these two sub-events, but I suspect that they are related to Intel's "Data Direct IO" functionality first introduced in the Xeon E5 processor series. With this feature, IO DMA traffic is written directly to the LLC instead of being written to memory, so when a core is interrupted to handle the IO, the data is available at much lower latency and higher bandwidth. This makes the most sense for network interface traffic, where the packets are small enough for memory latency to be a non-trivial overhead and also small enough that you don't need to worrying about overflowing the LLC. The "Data Direct IO" feature is enabled by default on Xeon E5 processors.
With this background, the "PCI Write Transfer (allocating) (full cache line)" event seems like a reasonable description of a DMA write with Data Direct IO operational -- it "allocates" into the LLC.
It is less clear how to interpret the other event "PCI Write Transfer (non-allocating) (full cache line)". I have two ideas:
Of course this is all just speculation, and these events might have nothing to do with DDIO.
It would be delightful if Intel decided to document these features in more detail.
It is unlikely that this issue would be directly addressed by the PCIe specification -- it is a processor uncore counter that says something about the PCIe implementation on the Xeon E5 processor. The PCIe spec says very little about caches or cache lines, though there is a "hint" field that can be programmed to "steer" a system-memory transaction toward a processor or cache, but these "hints" are optional, and there is no requirement that a system interpret them in any particular way.
There is more overlap between the PCIe spec and the uncore counters with regard to the "no snoop required" bit, but again the PCIe spec does not require any particular behavior -- a processor is allowed to snoop requests with the "no snoop required" bit set, and a processor is allowed to refrain from snooping requests with the "no snoop required" bit cleared it if can be guaranteed (e.g., by an MTRR attribute) that the transaction is to an address that cannot be cached.
Looking at the PCM code, both are write events: PCI devices writing to memory - application reads from disk/network/PCIe device. I'm guessing that PCIeWiLF results in a write to memory but not a transfer to the CPU's cache. The 'iLF' part of the name is curious... perhaps it means that the data is copied to a LineFill buffer and not copied to cache (if that is possible).
And I'm guessing that PCIeItoM results in a write to memory and copying the line to cache. This along the lines of Dr. McCalpin's answer above.
I'll see if I can find someone who knows more definitively. Is your question just because you are curious or is it related to a problem you are trying to solve?
I was thinking that PCIeWiLF could represent for example write to non cacheable memory by maybe display driver.AFAIK primitive 3d data like vertices which will be used only once will not be cached.
First of all thanks for the answers.
I probably need to tell more about my setup...
I have two servers connected through IB (Mellanox HCAs).
I've tried to test two different scenarios:
1) RDMA write to RAM --> causes PCIeItoM counter increase
2) RDMA write directly to Prefatchable BAR on other PCIe device --> causes PCIeWiLF counter increase
So your explanations seem inline with my results. While writing to RAM it goes to cache and to other PCIe it skips cache (and goes to LineFill buffer???)
I had not considered peer-to-peer operations when I was thinking about how to interpret these two events.
Since these are LLC CBo events, I think it still makes sense to interpret the word "allocating" to mean that the PCIe write was placed in the cache. Intel's DDIO documentation says that by default all DMA writes to memory will be written to the L3 cache, so your first results (RDMA writes to DRAM) increment the "allocating" counter as expected.
The second experiment (RDMA write to prefetchable BAR on another PCIe device) is not writing to system memory, so it should not be allocated in the L3 cache, and your results show that this case increments the "non-allocating" counter.
It is not immediately obvious why the L3 CBo should even take note of peer-to-peer PCIe write transactions. These transactions clearly pass by on the ring, but in general the PCIe BAR address ranges cannot be cached, so there will be no need to invalidate any lines in the L3 on writes to those address ranges.
Hypothesis #1: Maybe the CBo counts these events just because it can. The DDIO functionality means that it has to be able to cache PCIe DMA writes to system memory, so being able to count PCIe DMA writes that it does not have to cache is an obvious extension. Otherwise you would have to count PCIe DMA writes at the R2PCIe agent and subtract off the allocating writes from the CBo to get a count of the non-allocating writes, and that seems fairly inconvenient.
Hypothesis #2: Maybe the CBo counts these events because there are some circumstances in which the hardware can support (limited) caching of PCIe BAR address ranges (probably not with the Write-Back memory type, but Write-Through and Write-Protect seem plausible), so it might as well count transactions that could (if the MTRRs were different) require L3 tag access to invalidate cached copies of those lines.
If your setup contains multiple sockets, it would be interesting to see if the behavior is different when doing PCIe RDMA writes to DRAM buffers allocated on the socket with the IB card attached versus RDMA writes to DRAM buffers allocated on the other socket. It would also be interesting to see how the counts change when the PCIe peer-to-peer DMA is same-socket vs cross-socket.
From the Dr Bandwidth's quote below, does anybody figured out which one is true, the former of the later ?
McCalpin, John (Blackbelt) wrote:
The Intel documentation (both the reference above and other docs) says "Currently, Intel DDIO affects only local sockets". This might mean that the data is always put in the LLC of the socket to which the IO device is attached, or it might mean that the data is put in the LLC of the socket to which the device is attached *if* that socket is the "home" for the addresses being used. In the latter case, PCIe DMA writes to addresses "homed" on the remote socket would be written to memory (on the remote socket) and not put in any LLC.
The "PCIeWiLF" transaction name is of the same form as the "WCiLF" transaction in Table 3-1 of the Xeon Scalable Memory Family Uncore Performance Monitoring Reference Manual (document 336274). The "WCiLF" transaction is parsed as:
This is the most common transaction generated by streaming stores -- it writes the full cache line to memory, and invalidates any copies of that line in any caches in the system (even if a cache has the line in a modified/dirty state).
If the "PCIeWiLF" name has the same meaning, then this is a full-cacheline write to the target address, with the side effect invalidating the target address in all caches. The invalidation part of the transaction may or may not actually do anything. IO writes to addresses that might be cacheable need to globally invalidated. To determine the potential for caching of addresses targeted by IO, the agent can only look at the MTRRs. If the address is mapped to an MTRR type of UC, then it knows the address cannot be cached, and no global invalidate is required. On the other hand, if the address is mapped to WT or WP, then caching is allowed. MTRR type WB is not allowed for MMIO ranges. This leaves type WC, which I think is the most common memory type for "prefetchable MMIO" memory. Type WC does not allow caching of reads, but it does allow speculative reads. It probably does not require global invalidates, but an implementation might have some special characteristics (e.g., the "streaming loads" provided by the MOVNTDQA instruction) that make broadcast invalidations necessary.
I think memory ranges for devices DMAs are cacheable, and are used by DDIO.
I still don't understand one thing, when IO device DMA to or from physical memory of the numa node remote to the device, then where is DMA terminated (in case of read and in case of write)? Is it in LLC of the node local to the IO device? Or in the physical memory of the remote node? I assume that IO device, can't DMA directly to the LLC on the remote node.
The documentation does not seem clear to me, but it should be easy to test? It might take some guessing to figure out which paths are taken for DMA writes to a target buffer in system memory on a non-local socket, but with uncore performance counter coverage on both sockets it should be straightforward to figure this out.
The comments above on measurement with two nodes connected by InfiniBand don't say anything about multi-socket NUMA.
According to the observation, so-called device's remote reads are done from the remote memory. Remote memory controller shows the same read throughput as device receives. But one thing is strange.
If buffer is in the remote LLC (CPU just wrote it) and device now does remote read (from remote memory) of this buffer then this buffer has to be first written to memory from the LLC, so device could read the recent version of the buffer from the memory? Otherwise we would have a coherency problems, right?
SO the strange thing is that I only see reads, no writes and the data is synced. How is the remote LLC cache is synced with the memory, not clear for me. What I am missing here?
Dirty data does not have to be written to memory on a remote access. IO reads from cacheable space can use the "Read Current" transaction to read the cache line without changing its state. If you are seeing DRAM reads at the same time, these are probably speculative reads launched before the LLC has been probed (to keep latency down for the common case of missing in the LLC). If the line is found in a dirty state in the LLC (M or F), the data from DRAM is silently discarded. It is less clear what transactions will be used if a core initiates a read from a cacheable PCIe BAR.
The "final" disposition of these lines can be tricky. Recall that DMA writes will overwrite M state lines (without evicting them), so a buffer could be updated by IO and then read by IO repeatedly without ever needing to write to memory.
There are more special cases here than I can think up on short notice -- cacheable MMIO requires WP or WT memory types, neither of which allow dirty data in the processor caches (L1+L2). Allowing such lines to be dirty in the LLC is possible, but the details of the resulting protocol might not be easy to guess in advance.
McCalpin, John (Blackbelt) wrote:
DMA writes will overwrite M state lines (without evicting them), so a buffer could be updated by IO and then read by IO repeatedly without ever needing to write to memory.
I think this happens only if the DMA write is local to the IO device. When DMA write destined to remote memory I see writes on the remote memory controller. In case of a single TCP stream memory controller bandwidth x3 of the NIC bandwidth. I suspect these are (1) DMA write to memory, (2) kernel's read brings buffer to local LLC, (3) copy-to-user eventually writes back to memory. Does it make sense?
Would it be fair to say that Intel's statement "DDIO doesn't work remotely" actually means that remote writes indeed go to remote memory, remote reads require memory reads which makes remote reads less efficient.
Your data certainly suggests that IO DMA writes to system addresses on the other socket will write to memory and not to either the remote or local LLC. This is a reasonable implementation choice -- if you really care about latency, you will want the core handling the interrupts to be in the socket where the IO device is physically attached.
I am not sure about the expected bulk DRAM traffic...