PCIe performance counters

Roman_V_ · ‎06-25-2014

Hi All,

Can someone please explain the difference between:

PCIeWiLF - PCIe Write transfer (non-allocating) (full cache line)
PCIeItoM - PCIe Write transfer (allocating) (full cache line)

Or point me to relevant documentation.

Thanks in advance,

Roman

McCalpinJohn · ‎06-25-2014

I have also been unable to find documentation on these two sub-events, but I suspect that they are related to Intel's "Data Direct IO" functionality first introduced in the Xeon E5 processor series. With this feature, IO DMA traffic is written directly to the LLC instead of being written to memory, so when a core is interrupted to handle the IO, the data is available at much lower latency and higher bandwidth. This makes the most sense for network interface traffic, where the packets are small enough for memory latency to be a non-trivial overhead and also small enough that you don't need to worrying about overflowing the LLC. The "Data Direct IO" feature is enabled by default on Xeon E5 processors.

Reference: http://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf

With this background, the "PCI Write Transfer (allocating) (full cache line)" event seems like a reasonable description of a DMA write with Data Direct IO operational -- it "allocates" into the LLC.

It is less clear how to interpret the other event "PCI Write Transfer (non-allocating) (full cache line)". I have two ideas:

The Intel documentation (both the reference above and other docs) says "Currently, Intel DDIO affects only local sockets". This might mean that the data is always put in the LLC of the socket to which the IO device is attached, or it might mean that the data is put in the LLC of the socket to which the device is attached *if* that socket is the "home" for the addresses being used. In the latter case, PCIe DMA writes to addresses "homed" on the remote socket would be written to memory (on the remote socket) and not put in any LLC.
Alternately, it is possible that there is a mechanism that could be used to disable DDIO for certain PCIe devices. One could imagine that extremely large block IO (for a Lustre filesystem operating over InfiniBand, for example) might displace too much useful data from the LLC to make DDIO appropriate for those transactions. If such a mechanism exists (and I have seen no documentation on this topic), the corresponding PCIe DMA write transactions would fit in the "non-allocating" category.

Of course this is all just speculation, and these events might have nothing to do with DDIO.

It would be delightful if Intel decided to document these features in more detail.

shewan · ‎06-19-2021

Hi McCalpinJohn

One thing came up to me that, is it possible to disable ddio for certain ranges of memory by setting the memory type of those ranges as uncachable( using, maybe like MTRR)?

Bernard · ‎06-26-2014

Maybe PCIe specification can be helpful in your case?

http://komposter.com.ua/documents/PCI_Express_Base_Specification_Revision_3.0.pdf

McCalpinJohn · ‎06-26-2014

It is unlikely that this issue would be directly addressed by the PCIe specification -- it is a processor uncore counter that says something about the PCIe implementation on the Xeon E5 processor. The PCIe spec says very little about caches or cache lines, though there is a "hint" field that can be programmed to "steer" a system-memory transaction toward a processor or cache, but these "hints" are optional, and there is no requirement that a system interpret them in any particular way.

There is more overlap between the PCIe spec and the uncore counters with regard to the "no snoop required" bit, but again the PCIe spec does not require any particular behavior -- a processor is allowed to snoop requests with the "no snoop required" bit set, and a processor is allowed to refrain from snooping requests with the "no snoop required" bit cleared it if can be guaranteed (e.g., by an MTRR attribute) that the transaction is to an address that cannot be cached.

Bernard · ‎06-26-2014

You are right.This question will not be addressed by PCIe specification.I simply did not pay enough attention to the thread title.

Patrick_F_Intel1 · ‎06-26-2014

Hello Roman,

Looking at the PCM code, both are write events: PCI devices writing to memory - application reads from disk/network/PCIe device. I'm guessing that PCIeWiLF results in a write to memory but not a transfer to the CPU's cache. The 'iLF' part of the name is curious... perhaps it means that the data is copied to a LineFill buffer and not copied to cache (if that is possible).

And I'm guessing that PCIeItoM results in a write to memory and copying the line to cache. This along the lines of Dr. McCalpin's answer above.

I'll see if I can find someone who knows more definitively. Is your question just because you are curious or is it related to a problem you are trying to solve?

Pat

Bernard · ‎06-27-2014

I was thinking that PCIeWiLF could represent for example write to non cacheable memory by maybe display driver.AFAIK primitive 3d data like vertices which will be used only once will not be cached.

Roman_V_ · ‎06-29-2014

Hi All,

First of all thanks for the answers.

I probably need to tell more about my setup...

I have two servers connected through IB (Mellanox HCAs).

I've tried to test two different scenarios:

1) RDMA write to RAM --> causes PCIeItoM counter increase

2) RDMA write directly to Prefatchable BAR on other PCIe device --> causes PCIeWiLF counter increase

So your explanations seem inline with my results. While writing to RAM it goes to cache and to other PCIe it skips cache (and goes to LineFill buffer???)

Thanks again,

Roman

McCalpinJohn · ‎06-30-2014

I had not considered peer-to-peer operations when I was thinking about how to interpret these two events.

Since these are LLC CBo events, I think it still makes sense to interpret the word "allocating" to mean that the PCIe write was placed in the cache. Intel's DDIO documentation says that by default all DMA writes to memory will be written to the L3 cache, so your first results (RDMA writes to DRAM) increment the "allocating" counter as expected.

The second experiment (RDMA write to prefetchable BAR on another PCIe device) is not writing to system memory, so it should not be allocated in the L3 cache, and your results show that this case increments the "non-allocating" counter.

It is not immediately obvious why the L3 CBo should even take note of peer-to-peer PCIe write transactions. These transactions clearly pass by on the ring, but in general the PCIe BAR address ranges cannot be cached, so there will be no need to invalidate any lines in the L3 on writes to those address ranges.

Hypothesis #1: Maybe the CBo counts these events just because it can. The DDIO functionality means that it has to be able to cache PCIe DMA writes to system memory, so being able to count PCIe DMA writes that it does not have to cache is an obvious extension. Otherwise you would have to count PCIe DMA writes at the R2PCIe agent and subtract off the allocating writes from the CBo to get a count of the non-allocating writes, and that seems fairly inconvenient.

Hypothesis #2: Maybe the CBo counts these events because there are some circumstances in which the hardware can support (limited) caching of PCIe BAR address ranges (probably not with the Write-Back memory type, but Write-Through and Write-Protect seem plausible), so it might as well count transactions that could (if the MTRRs were different) require L3 tag access to invalidate cached copies of those lines.

If your setup contains multiple sockets, it would be interesting to see if the behavior is different when doing PCIe RDMA writes to DRAM buffers allocated on the socket with the IB card attached versus RDMA writes to DRAM buffers allocated on the other socket. It would also be interesting to see how the counts change when the PCIe peer-to-peer DMA is same-socket vs cross-socket.

nlnnfn__Alex · ‎08-14-2019

Hello,

From the Dr Bandwidth's quote below, does anybody figured out which one is true, the former of the later ?

Thanks

McCalpin, John (Blackbelt) wrote:
The Intel documentation (both the reference above and other docs) says "Currently, Intel DDIO affects only local sockets". This might mean that the data is always put in the LLC of the socket to which the IO device is attached, or it might mean that the data is put in the LLC of the socket to which the device is attached *if* that socket is the "home" for the addresses being used. In the latter case, PCIe DMA writes to addresses "homed" on the remote socket would be written to memory (on the remote socket) and not put in any LLC.

McCalpinJohn · ‎08-15-2019

The "PCIeWiLF" transaction name is of the same form as the "WCiLF" transaction in Table 3-1 of the Xeon Scalable Memory Family Uncore Performance Monitoring Reference Manual (document 336274). The "WCiLF" transaction is parsed as:

"WC" = Write Combining transaction
"i" = the final state of the target address in system caches is "invalid"
"L" = this is an operation on a cache line
"F" = this is an operation on a full cache line

This is the most common transaction generated by streaming stores -- it writes the full cache line to memory, and invalidates any copies of that line in any caches in the system (even if a cache has the line in a modified/dirty state).

If the "PCIeWiLF" name has the same meaning, then this is a full-cacheline write to the target address, with the side effect invalidating the target address in all caches. The invalidation part of the transaction may or may not actually do anything. IO writes to addresses that might be cacheable need to globally invalidated. To determine the potential for caching of addresses targeted by IO, the agent can only look at the MTRRs. If the address is mapped to an MTRR type of UC, then it knows the address cannot be cached, and no global invalidate is required. On the other hand, if the address is mapped to WT or WP, then caching is allowed. MTRR type WB is not allowed for MMIO ranges. This leaves type WC, which I think is the most common memory type for "prefetchable MMIO" memory. Type WC does not allow caching of reads, but it does allow speculative reads. It probably does not require global invalidates, but an implementation might have some special characteristics (e.g., the "streaming loads" provided by the MOVNTDQA instruction) that make broadcast invalidations necessary.

nlnnfn__Alex · ‎08-15-2019

I think memory ranges for devices DMAs are cacheable, and are used by DDIO.

I still don't understand one thing, when IO device DMA to or from physical memory of the numa node remote to the device, then where is DMA terminated (in case of read and in case of write)? Is it in LLC of the node local to the IO device? Or in the physical memory of the remote node? I assume that IO device, can't DMA directly to the LLC on the remote node.

McCalpinJohn · ‎08-16-2019

The documentation does not seem clear to me, but it should be easy to test? It might take some guessing to figure out which paths are taken for DMA writes to a target buffer in system memory on a non-local socket, but with uncore performance counter coverage on both sockets it should be straightforward to figure this out.

The comments above on measurement with two nodes connected by InfiniBand don't say anything about multi-socket NUMA.

nlnnfn__Alex · ‎08-16-2019

According to the observation, so-called device's remote reads are done from the remote memory. Remote memory controller shows the same read throughput as device receives. But one thing is strange.

If buffer is in the remote LLC (CPU just wrote it) and device now does remote read (from remote memory) of this buffer then this buffer has to be first written to memory from the LLC, so device could read the recent version of the buffer from the memory? Otherwise we would have a coherency problems, right?

SO the strange thing is that I only see reads, no writes and the data is synced. How is the remote LLC cache is synced with the memory, not clear for me. What I am missing here?

McCalpinJohn · ‎08-19-2019

Dirty data does not have to be written to memory on a remote access. IO reads from cacheable space can use the "Read Current" transaction to read the cache line without changing its state. If you are seeing DRAM reads at the same time, these are probably speculative reads launched before the LLC has been probed (to keep latency down for the common case of missing in the LLC). If the line is found in a dirty state in the LLC (M or F), the data from DRAM is silently discarded. It is less clear what transactions will be used if a core initiates a read from a cacheable PCIe BAR.

The "final" disposition of these lines can be tricky. Recall that DMA writes will overwrite M state lines (without evicting them), so a buffer could be updated by IO and then read by IO repeatedly without ever needing to write to memory.

There are more special cases here than I can think up on short notice -- cacheable MMIO requires WP or WT memory types, neither of which allow dirty data in the processor caches (L1+L2). Allowing such lines to be dirty in the LLC is possible, but the details of the resulting protocol might not be easy to guess in advance.

nlnnfn__Alex · ‎08-20-2019

McCalpin, John (Blackbelt) wrote:
DMA writes will overwrite M state lines (without evicting them), so a buffer could be updated by IO and then read by IO repeatedly without ever needing to write to memory.

I think this happens only if the DMA write is local to the IO device. When DMA write destined to remote memory I see writes on the remote memory controller. In case of a single TCP stream memory controller bandwidth x3 of the NIC bandwidth. I suspect these are (1) DMA write to memory, (2) kernel's read brings buffer to local LLC, (3) copy-to-user eventually writes back to memory. Does it make sense?

Would it be fair to say that Intel's statement "DDIO doesn't work remotely" actually means that remote writes indeed go to remote memory, remote reads require memory reads which makes remote reads less efficient.

McCalpinJohn · ‎08-20-2019

Your data certainly suggests that IO DMA writes to system addresses on the other socket will write to memory and not to either the remote or local LLC. This is a reasonable implementation choice -- if you really care about latency, you will want the core handling the interrupts to be in the socket where the IO device is physically attached.

I am not sure about the expected bulk DRAM traffic...

The IO DMA should write the data to DRAM (without requiring a read).
The kernel read will have to read the data from DRAM.
The kernel has several versions of "copy_to_user()", with different behaviors....
- If the user has a buffer that is already in cache, the kernel copy-to-user would simply update the cached copy.
  - If the user buffer is re-used (for multiple kernel copy_to_user() calls), it could be updated multiple times before being written back to memory.
  - If it is large or only used once, the buffer would then eventually be written back to memory, giving give a total of 2 DRAM writes and 1 DRAM read for each data element.
- If the kernel uses non-temporal version of "copy_to_user()", this would write directly to DRAM, but then the user process would have to read back from DRAM, giving a total of 4 DRAM accesses (2 Writes + 2 Reads).