Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Monitoring PCIe Data for Xeon E5-2600

funksoulbr
Beginner
309 Views

Hello all,

I am monitoring PCIe data on an Xeon E5-2600 machine according to the Uncore Performance Monitoring Guide Table 2-15.

I wonder what kind of transactions this is monitoring. I conducted some tests where I generated a lot of MMIO transactions to a PCIe device (using Linux; ioremap a PCIe device's BAR and use the readl/writel instructions), but they do NOT show up for above stated scenario.

I can see these instructions with the OFFCORE_RESPONSE counters, but why there and not by filtering for PCIe packets like in above's scenario? This feels a bit unintuitive.

Cheers!

0 Kudos
2 Replies
McCalpinJohn
Honored Contributor III
309 Views

I have not had a chance to do any experiments with this yet, but I think that you are not seeing the MMIO transactions using the CBo performance monitors because your transactions not of a type that the filter is set to count.

I am assuming that you are monitoring PCIE_DATA_BYTES from Table 2-15.  This is defined as
               TOR_INSERTS.OPCODE with Cn_MSR_PMON_BOX_FILTER.opc=0x194
plus         TOR_INSERTS.OPCODE with Cn_MSR_PMON_BOX_FILTER.opc=0x19C*64

Going back to Table 2-13, we see that filter 0x194 corresponds to "PCIe Write (non-allocating)" while filter 0x19C corresponds to "PCIe Write (allocating)".    Looking over the rest of the items in Table 2-13, note this does not include any PCIe read operations, nor does it include PCIe Non-Snoop Write operations. 

As a potentially important matter of definitions: it is not exactly clear how an MMIO read initiated by the processor is interpreted here -- it does retrieve data from the PCIe device, but that is not really the same as a write initiated by the PCIe device.  Similarly, an MMIO write initiated by the processor does transfer data to the PCIe device, but it is not really the same as a read initiated by a PCIe device.  It is possible that you need to PCIe device to *initiate* these transactions to get them to be counted by any of these counters labelled PCIe.  Or not.   Comments from Intel (or anyone else with low-level PCIe IO experience) would be welcome here!

Back to the main story:  The PCIe MMIO space pointed to by the BARs corresponds to physical addresses in the "IO memory hole" of the system, which has a default mapping of Uncached -- either via an explicit MTRR, or more commonly by setting the default mapping for regions not mapped by an MTRR to uncached.   Some sub-regions of the IO hole address range will have an MTRR of WC (Write Combining), but that is also an uncached memory type. 

So when you map a BAR and perform reads and writes, you are almost certainly generating transactions that are known by the hardware to be uncacheable -- so snoops are not required and the transactions (if counted at all) are counted by slightly different filter events.   You may be able to see your MMIO transactions if you switch the filter types to 0x1E4 ("PCIe Non-Snoop Read") for reads and 0x1E5 ("PCIe Non-Snoop Write (partial)") for writes.  I don't think that you will be able to generate any 0x1E6 transactions ("PCIe Non-Snoop Read (full)") with this approach, but it is easy enough to check.

If neither of these approaches work, then you will need to measure in a test case for which you know that the PCIe device is initiating  transactions.  A simple test would be measuring the events before and after reading a large file.  To avoid file system caching effects you can either manually drop the caches before the test or you can simply choose a file that you know has not been read since the system boot.

Advanced Notes: 
(1) Caching is controlled by both the MTRRs and the Page Attribute Table (PAT) entries in the Page Tables.   A PAT can be used to upgrade one of the Uncached memory types to Write Combining, but cannot "upgrade" an MTRR-based Uncached region to Cached.  This is good because only the cores can see the PATs (since only the cores know the original virtual address), and this policy ensures that the MTRRs (which are visible to the uncore) can be used by the hardware in the uncore to determine whether snooping is required based on the physical address.  
(2) As I mentioned in another note on this forum (topic 401498), I have not yet figured out whether the PCIe "no snoop required" bit is used by the Intel hardware or ignored (leaving the snoop vs no-snoop decision up to the MTRR settings).   If the "no snoop required" bit is set and the hardware snoops anyway, you will always get the correct data even if the driver was incorrect in its assumption that setting the "no snoop required" bit was safe.  In contrast, basing snoop vs no-snoop on the MTRRs is always safe (assuming only that the MTRRs were correctly configured to be identical on all cores).   If you know how to program a PCIe DMA engine to to transfers with and without this bit set, then you should be able to test the hypothesis using the CBo counters and/or the QPI counters (to see if snoops associated with the PCIe transaction are being propagated to the other socket).
(3) There are no more advanced notes today.  :-)

 

0 Kudos
funksoulbr
Beginner
309 Views

First of all, thank your for your elaborate answer. I really appreciate it!

John D. McCalpin wrote:

I have not had a chance to do any experiments with this yet, but I think that you are not seeing the MMIO transactions using the CBo performance monitors because your transactions not of a type that the filter is set to count.

I am assuming that you are monitoring PCIE_DATA_BYTES from Table 2-15.  This is defined as
               TOR_INSERTS.OPCODE with Cn_MSR_PMON_BOX_FILTER.opc=0x194
plus         TOR_INSERTS.OPCODE with Cn_MSR_PMON_BOX_FILTER.opc=0x19C*64

I tested the CBo counters with the following simple netperf setup:

  • Xeon initiates netperf to remote machine: "PCIe read current" and "PCIe non-snoop read" events are counted.
  • Remote machine initiates netperf to Xeon: "PCIe Write (non-allocating)" events are counted.
  • Remaining filter opcodes which have PCIe in ther name: Don't count at all for a netperf or MMIO test.

MMIO transactions from CPU to PCIe BAR don't get counted by any of these filters.

John D. McCalpin wrote:

Going back to Table 2-13, we see that filter 0x194 corresponds to "PCIe Write (non-allocating)" while filter 0x19C corresponds to "PCIe Write (allocating)".    Looking over the rest of the items in Table 2-13, note this does not include any PCIe read operations, nor does it include PCIe Non-Snoop Write operations. 

As a potentially important matter of definitions: it is not exactly clear how an MMIO read initiated by the processor is interpreted here -- it does retrieve data from the PCIe device, but that is not really the same as a write initiated by the PCIe device.  Similarly, an MMIO write initiated by the processor does transfer data to the PCIe device, but it is not really the same as a read initiated by a PCIe device.  It is possible that you need to PCIe device to *initiate* these transactions to get them to be counted by any of these counters labelled PCIe.  Or not.   Comments from Intel (or anyone else with low-level PCIe IO experience) would be welcome here!

Until now, I never thought about it that way. But it may very well be the case that the CBo counter only count rd/wr transactions initiated by the PCIe device! Unfortunately I can not test this due to lack of programmable DMA engine on my side. Comments from Intel on what really is counted here would be of great help!

John D. McCalpin wrote:

Back to the main story:  The PCIe MMIO space pointed to by the BARs corresponds to physical addresses in the "IO memory hole" of the system, which has a default mapping of Uncached -- either via an explicit MTRR, or more commonly by setting the default mapping for regions not mapped by an MTRR to uncached.   Some sub-regions of the IO hole address range will have an MTRR of WC (Write Combining), but that is also an uncached memory type. 

So when you map a BAR and perform reads and writes, you are almost certainly generating transactions that are known by the hardware to be uncacheable -- so snoops are not required and the transactions (if counted at all) are counted by slightly different filter events.   You may be able to see your MMIO transactions if you switch the filter types to 0x1E4 ("PCIe Non-Snoop Read") for reads and 0x1E5 ("PCIe Non-Snoop Write (partial)") for writes.  I don't think that you will be able to generate any 0x1E6 transactions ("PCIe Non-Snoop Read (full)") with this approach, but it is easy enough to check.

See above. MMIO is not counted at all by the CBo boxes and events from Table 2-15.

However, I was able to count MMIO with OFFCORE_RESPONSE:OTHER:NON_DRAM. The Intel SDM actually says for the NON_DRAM response type that it counts MMIO.

BUT, this event also counts lots and lots of other transactions (but I don't know exactly which ones) so that the small amount of PCIe MMIO really gets missing for a busy system.

My original intention was to monitor the traffic that a Core generates by rd/wr transactions to MMIO PCIe areas. But until now, I did not find a solution that counts ONLY these.

How easy my life would be if there were counters for transactions to a programmable range of physical addresses :(

John D. McCalpin wrote:

If neither of these approaches work, then you will need to measure in a test case for which you know that the PCIe device is initiating  transactions.  A simple test would be measuring the events before and after reading a large file.  To avoid file system caching effects you can either manually drop the caches before the test or you can simply choose a file that you know has not been read since the system boot.

Advanced Notes: 
(1) Caching is controlled by both the MTRRs and the Page Attribute Table (PAT) entries in the Page Tables.   A PAT can be used to upgrade one of the Uncached memory types to Write Combining, but cannot "upgrade" an MTRR-based Uncached region to Cached.  This is good because only the cores can see the PATs (since only the cores know the original virtual address), and this policy ensures that the MTRRs (which are visible to the uncore) can be used by the hardware in the uncore to determine whether snooping is required based on the physical address.  
(2) As I mentioned in another note on this forum (topic 401498), I have not yet figured out whether the PCIe "no snoop required" bit is used by the Intel hardware or ignored (leaving the snoop vs no-snoop decision up to the MTRR settings).   If the "no snoop required" bit is set and the hardware snoops anyway, you will always get the correct data even if the driver was incorrect in its assumption that setting the "no snoop required" bit was safe.  In contrast, basing snoop vs no-snoop on the MTRRs is always safe (assuming only that the MTRRs were correctly configured to be identical on all cores).   If you know how to program a PCIe DMA engine to to transfers with and without this bit set, then you should be able to test the hypothesis using the CBo counters and/or the QPI counters (to see if snoops associated with the PCIe transaction are being propagated to the other socket).
(3) There are no more advanced notes today.  :-)

 

0 Kudos
Reply