Solved: How to Record/Calculate Channel Stalls

Ben_O_ · ‎12-10-2015

I am trying to determine how much or how often application performance is constrained by the available channel bandwidth. Is there a way to determine when the memory system stalls due to congestion on the memory channels?

McCalpinJohn · ‎12-11-2015

The concept of a "stall" seems straightforward, but in modern hardware it is actually extremely difficult to define precisely.

Lots of details are discussed at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/514733

Another technique is to vary the CPU core frequency (and DRAM channel frequency, if your BIOS supports that) and see how the execution time changes. This is discussed briefly at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/601488

View solution in original post

TimP · ‎12-10-2015

You may get better answers by checking references cited on the vtune amplifier or software tuning forum and following up with questions there.

McCalpinJohn · ‎12-11-2015

The concept of a "stall" seems straightforward, but in modern hardware it is actually extremely difficult to define precisely.

Lots of details are discussed at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/514733

Another technique is to vary the CPU core frequency (and DRAM channel frequency, if your BIOS supports that) and see how the execution time changes. This is discussed briefly at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/601488

Ben_O_ · ‎12-11-2015

Those links are extremely helpful, John. Reading through them, and taking into account the specific project that I'm working on, it seems as if simply getting an estimate of the LLC latency is sufficient for my purposes. Simply looking for patterns in the changes in latency should be enough to be able to tell if a particular channel is being overworked, I think, if given a sizable threshold.

I'll dive further into reading these counters first (using your reply at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/543477), and then return if I've got any more specific questions.

Thanks for your help!

Ben_O_ · ‎12-11-2015

One quick question: one of Patrick Fay's comments say that I can use the UNC_CLOCKTICKS and UNC_ARB_TRK_REQUESTS.ALL events to calculate the effective latency, but I can find nowhere that says which memory address in the MSR UNC_CLOCKTICKS is; moreover, searching for that event online yields absolutely no other results. I've also searched through the entire Architecture Developer's Manual, which also doesn't tell how to measure uncore clockticks. Can you point me in the right direction as to how to measure these events? The Intel PCM tools (types.h and cpucounters.cpp) don't seem to help, either.

McCalpinJohn · ‎12-12-2015

The uncore clock can be measured using a variety of "boxes" in the uncore. This is documented in Intel document 331051, "Intel Xeon Processor E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual", revision 002, June 2015. (The corresponding document for the Xeon E5 v2 (Ivy Bridge) platforms is 329469, and the corresponding document for the Xeon E5 v1 (Sandy Bridge) is 327043.)

There are several different clock frequencies in use in the uncore, so if you use a counter in a different box than the one you are interested in, you need to make sure the two boxes are running at the same frequency. See detailed notes below...

Most of the boxes in the uncore run at the "Uncore Clock Frequency", and the easiest Uncore Clock counter to use is probably the UBox fixed-function counter, described in Section 2.2.2 of the Xeon E5 v3 uncore guide. This one is convenient to use because it is accessed through an MSR (rather than PCI configuration space) and since it is a fixed-function counter it can't be used for anything else.

More details:

Most of the boxes in the uncore run at the "Uncore Clock Frequency":

Caching Agent (Cbo): Uncore Clock Frequency
Home Agent (HA): Uncore Clock Frequency
Ring to PCIe (R2PCIe): Uncore Clock Frequency
Ring to QPI (R3QPI): Uncore Clock Frequency
Ring Transfer (SBox): Uncore Clock Frequency
UBox: Uncore Clock Frequency

But a few of the units count at a different frequency:

Memory Controller (IMC): ----> DRAM channel frequency (fixed)
Power Control Unit (PCU): ----> 800 MHz (fixed)
QPI Link Layer: ----> QPI base frequency (fixed)

The Uncore Clock Frequency can vary over time.

In many configurations it will track the CPU core frequency.
- I usually set the Uncore Clock Frequency to "maximum" in the BIOS to get the lowest latency and the best overall performance (at the cost of slightly increased power consumption).
- I still have to monitor the frequency because it will be throttled just like the CPU cores if the package hits a power limit or thermal limit.
The UBox fixed function cycle counter does not appear to count when the chip is in a "package C-state".
- I usually put a "spinner" process on one core of a chip if I don't want it to go into a package C-state.
I have not checked the cycle counters in the other uncore boxes to see if they count while the chip is in a package C-state.
- The PCU looks like the best bet in this case, since the PCU controls the package C states.

Ben_O_ · ‎12-12-2015

This is extremely useful, thank you.

Today, after referring to the manual that you mentioned, I was able to read clockticks using one of the CBo counters. Thus, I'm well on my way to getting the information that I need!

Thanks again.

Ben_O_ · ‎12-12-2015

I've been trying to get it to work all day, but to no avail. In Patrick's post that you linked earlier, he said:

More events you can use: UNC_ARB_TRK_OCCUPANCY.ALL/UNC_CLOCKTICKS which will tell you the average number of memory requests outstanding per uncore clocktick. This gives you an idea of how many requests are simultaneously outstanding.

Also UNC_ARB_TRK_OCCUPANCY.ALL/UNC_ARB_TRK_REQUESTS.ALL tells you average uncore clockticks a memory request is allocated in LLC. This is usually referred to as the LLC latency (Last Level Cache miss latency in uncore clockticks per LLC miss). It doesn't include the time to fetch the cache line from LLC to L1.

The effective latency is UNC_CLOCKTICKS / UNC_ARB_TRK_REQUESTS.ALL (in uncore clockticks / LCC miss)

This makes sense to me, so I set out to gather those two events (UNC_CLOCKTICKS and UNC_ARB_TRK_REQUESTS.ALL) to see if I got a reasonable value for LLC latency. I did successfully acquire the clockticks using your advice, however, I'm unable to get the other value, UNC_ARB_TRK_REQUESTS.ALL. Looking through the manual, it appears as if this is only available to i7, i5, and i3 Sandy Bridge processors, as well as all Haswell ones. However, the machines that I'm working on have Sandy Bridge Xeons.

Do you know of any equivalent to these events are, that my processor has?

McCalpinJohn · ‎12-13-2015

This is described in Section 2.3 of document 327043 ("Intel Xeon Processor E5-2600 Product Family Uncore Performance Monitoring Guide").

The main data structure that tracks transactions in the L3 is the "TOR" (Table Of Requests). Only Counter 0 in each CBo can accumulate occupancy, but Counters 2 and 3 can track "companion events" that can be useful (such as counting cycles in which there are no entries in the TOR).

Table 2-15 describes how to count and compute average latency in the TOR for read hits and misses. I have not tried these, so I don't know if the results agree with expectations.