Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Memory Controller (iMC) Performance Monitoring

James_M_3
Beginner
1,059 Views

Referencing "Intel Xeon Processor E5 v2 and E7 v2 Product Families Uncore Performance Monitoring Reference Manual", Reference number 329468-02, Feb 2014.

 

 

 

2.5.7 iMC Box Common Metrics (Derived Events). I see the formulas here for MEM_BW_READS and MEM_BW_WRITES (CAS_COUNT.RD * 64 and CAS_COUNT.WR * 64)

Section 2.5.6 iMC Box Events Ordered By Code. I see the event code 0x04 for CAS_COUNT, and Table 2-79 on page 85 shows me the umask values for looking at reads or writes (,etc). All good.

As I understand things, there are 4 memory channels for each socket (in my case, I'm looking at a 2 socket Ivy Bridge box). Table 2-71, iMC Performance Monitoring MSRs (page 78) list (among other things) 4 control registers, one for each of 4 counter registers. It appears that these registers sets exist for each of the memory channels.

Are there 4 sets of control and counters registers for each channel, or is cntrl/cntr set 0 for channel 0, cntrl/cntr set 1 for channel 1, etc? I believe there are a set of 4 registers per channel, which brings me to my second question - how do I determine the register address of each set of registers for each channel? As I have a 2 socket system, I would expect that I need to program the registers and read the counters for 8 channels, 2 per socket, so 8 sets of registers. What I am missing (sorry) is what the addresses are for each set of registers.

Here's an analogy. I just completed a simple Python script to track PCI throughput. The documentation contains a large table listing the addresses for each of the CBn register sets (C0 - C14, although the actual number will vary based on the system you're looking at). I expected a similar table listing the iMC MSR register addresses for each of the channels is each of N possible sockets.

Perhaps I need to add a base/offset value to move through each of the channels in each of the sockets?

Appreciate the guidance here.

Thanks

Jim M

 

 

 

 

 

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
1,059 Views

I have not spent a lot of time with the Xeon E5 v2 uncore performance monitoring guide because we only have one fairly small system with these processors, but I noticed that some of the wording is confusing because they are trying to cover both the E5 and E7 in one document, and it is not always clear how to interpret the text.

IF you are only using Xeon E5 v2 processors and not Xeon E7 v2 processors, then I think you can ignore references to "MC1" -- all four DRAM channels are connected to MC0.     If your OS and BIOS talk to each other nicely, you will be able to see the corresponding PCI configuration space areas in the output of the "lspci" command.   Unfortunately the BIOS and OS do not talk nicely to each other on my only Xeon E5 v2 system, so none of these PCI configuration space devices are visible.  (They can still be accessed directly using memory-mapped IO, but this is a very labor-intensive approach and I try to avoid it....)

The output of lspci should look very similar to what you get on a Xeon E5 v1 system, e.g., containing lines like:

3f:10.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0 (rev 07)
3f:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07)

...

3f:10.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2 (rev 07)
3f:10.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3 (rev 07)

...

7f:10.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0 (rev 07)
7f:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07)

...

7f:10.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2 (rev 07)
7f:10.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3 (rev 07)

The first 2 characters are the bus number -- bus 3f corresponds to socket 0 and bus 7f corresponds to socket 1.

Device 10 (implicitly hex) corresponds to what the Uncore Performance Monitoring Reference Manual refers to as "Device 16".

Functions 0, 1, 4, 5 contain the performance monitoring registers for the four channels.   You may notice that the mapping reported in the Uncore Performance Monitoring Guide differs between the Xeon E5 v1 (table 2-59 in the Xeon E5 v1 guide: 327043-001) and the Xeon E5 v2 (table 2-71 in the Xeon E5 v2 guide: 329468-002).   I don't know if the v2 guide is correct in this case -- there are certainly cases in the v3 guide where the uncore performance monitoring guide's description does not match the hardware.    The specific mapping of devices to channels probably does not make a lot of difference -- traffic is usually spread pretty evenly, and if you really need to know which device connects to which channel you can pretty easily generate test patterns that will hit only one channel at a time.  (Stride 256 accesses, starting at offsets 0B, 64B, 128B, and 192B should access only channels 0, 1, 2, and 3, respectively.  Disable the HW prefetchers to prevent spurious accesses.  If you really want to know how this maps to the physical DIMMs on the motherboard, you can boot the system with some DIMMs removed and see which channels no longer get counts.)

So you can see from the lspci output above that there are 4 PCI configuration space devices for each socket.  Each of these corresponds to one DRAM channel and supports four performance counters for that channel.   Within each PCI configuration space device, the four counters are programmed by 32-bit fields at offsets D8, DC, E0, and E4, and the counts appear in two consecutive 32-bit fields starting at offsets A0, A8, B0, and B8.

On our production systems we program the following four events for each channel:

  1. CAS_COUNT.RD
  2. CAS_COUNT.WR
  3. PRE_COUNT.PAGE_MISS
  4. ACT_COUNT

This is enough to compute the read and write bandwidths, and the page hit/page empty/page miss rates.  (What Intel calls "page empty" is usually called a "page miss" and what Intel calls "page miss" is usually called a "page conflict", but there is enough information in the document to figure out what they mean.)

 

 

View solution in original post

0 Kudos
2 Replies
James_M_3
Beginner
1,059 Views

I am still getting through this, attempting to determine how I can construct a correct MSR address based on the PCI space the IMC registers live in for the sockets and channels. The best information I have found thus far is from a reply from Dr. McCalpin:

https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/535023

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,060 Views

I have not spent a lot of time with the Xeon E5 v2 uncore performance monitoring guide because we only have one fairly small system with these processors, but I noticed that some of the wording is confusing because they are trying to cover both the E5 and E7 in one document, and it is not always clear how to interpret the text.

IF you are only using Xeon E5 v2 processors and not Xeon E7 v2 processors, then I think you can ignore references to "MC1" -- all four DRAM channels are connected to MC0.     If your OS and BIOS talk to each other nicely, you will be able to see the corresponding PCI configuration space areas in the output of the "lspci" command.   Unfortunately the BIOS and OS do not talk nicely to each other on my only Xeon E5 v2 system, so none of these PCI configuration space devices are visible.  (They can still be accessed directly using memory-mapped IO, but this is a very labor-intensive approach and I try to avoid it....)

The output of lspci should look very similar to what you get on a Xeon E5 v1 system, e.g., containing lines like:

3f:10.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0 (rev 07)
3f:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07)

...

3f:10.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2 (rev 07)
3f:10.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3 (rev 07)

...

7f:10.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0 (rev 07)
7f:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07)

...

7f:10.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2 (rev 07)
7f:10.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3 (rev 07)

The first 2 characters are the bus number -- bus 3f corresponds to socket 0 and bus 7f corresponds to socket 1.

Device 10 (implicitly hex) corresponds to what the Uncore Performance Monitoring Reference Manual refers to as "Device 16".

Functions 0, 1, 4, 5 contain the performance monitoring registers for the four channels.   You may notice that the mapping reported in the Uncore Performance Monitoring Guide differs between the Xeon E5 v1 (table 2-59 in the Xeon E5 v1 guide: 327043-001) and the Xeon E5 v2 (table 2-71 in the Xeon E5 v2 guide: 329468-002).   I don't know if the v2 guide is correct in this case -- there are certainly cases in the v3 guide where the uncore performance monitoring guide's description does not match the hardware.    The specific mapping of devices to channels probably does not make a lot of difference -- traffic is usually spread pretty evenly, and if you really need to know which device connects to which channel you can pretty easily generate test patterns that will hit only one channel at a time.  (Stride 256 accesses, starting at offsets 0B, 64B, 128B, and 192B should access only channels 0, 1, 2, and 3, respectively.  Disable the HW prefetchers to prevent spurious accesses.  If you really want to know how this maps to the physical DIMMs on the motherboard, you can boot the system with some DIMMs removed and see which channels no longer get counts.)

So you can see from the lspci output above that there are 4 PCI configuration space devices for each socket.  Each of these corresponds to one DRAM channel and supports four performance counters for that channel.   Within each PCI configuration space device, the four counters are programmed by 32-bit fields at offsets D8, DC, E0, and E4, and the counts appear in two consecutive 32-bit fields starting at offsets A0, A8, B0, and B8.

On our production systems we program the following four events for each channel:

  1. CAS_COUNT.RD
  2. CAS_COUNT.WR
  3. PRE_COUNT.PAGE_MISS
  4. ACT_COUNT

This is enough to compute the read and write bandwidths, and the page hit/page empty/page miss rates.  (What Intel calls "page empty" is usually called a "page miss" and what Intel calls "page miss" is usually called a "page conflict", but there is enough information in the document to figure out what they mean.)

 

 

0 Kudos
Reply