Solved: uncore event counter reading

yang_s_ · ‎03-19-2015

Hi everybody,

I want to read some values of uncore event counters on Intel xeon e5620. Recently I read about PCM and find it really hard to figure out the program logic. And there is no the uncore event I want, like UNC_GQ_ALLOC.READ_TRACKER. I see the pcm-tsx.cpp in PCM and replace the events in it with mine. I just want to see if this is gonna work. And I get some value and am not sure if this is right thing to do.

I also want to know if there is simpler tools to read these events.

Thanks.

Yang

McCalpinJohn · ‎03-19-2015

There is little related to the use of performance counters that can be considered "simple' :-(

If you are running a Linux operating system and have root access, then it is relatively easy to program these counters directly since they are exposed through the core performance counter interface which can be accessed through the /dev/cpu/*/msr device drivers. Reading the counters can be done through the same interface, or (if your kernel allows it) by inline use of the RDPMC instruction.

If the code does not need to be portable it can be a lot shorter than Intel's PCM. The information is all provided in section 18.2.2.2 of Volume 3 of the Intel Architecture SW Developer's Manual (document 325384-053).

There is a learning curve to get this stuff working. I generally use the "rdmsr" and "wrmsr" routines from "msrtools-1.2" to read and write the MSRs from the shell level. When I need to read/write the MSRs in C code, I just copy the relevant lines of code from "rdmsr.c" and/or "wrmsr.c" to use "pread()" or "pwrite()" instructions to read/write the MSRs via the /dev/cpu/*/msr device drivers.

Once you are familiar with the access mechanisms, then the procedure is:

Enable the counters by writing bits 3:0 of MSR IA32_PERF_GLOBAL_CTRL on each logical processor.
Program the counters by writing the MSRs IA32_PERFEVTSEL0 through IA32_PERFEVTSEL3 on each logical processor.
Read the counters by either using MSR IA32_PMC0 through IA32_PMC3 on each logical processor, or by using an inline assembler function to execute the RDPMC instruction with the desired counter argument (in register ECX), and concatenate the EDX (high-order 32 bits) and EAX (low-order 32 bits) to get the full 48-bit counter value.
1. The uncore counters described in Table 19-16 of Vol 3 of the SW Developer's Manual only need to be read once on each chip -- but there is no harm in reading them using each core - you will just get multiple copies of the same result.
2. Since you are handling all of these accesses manually, you will have to monitor for overflow/wraparound of the counters and compensate when this occurs.

View solution in original post

McCalpinJohn · ‎03-19-2015

There is little related to the use of performance counters that can be considered "simple' :-(

If you are running a Linux operating system and have root access, then it is relatively easy to program these counters directly since they are exposed through the core performance counter interface which can be accessed through the /dev/cpu/*/msr device drivers. Reading the counters can be done through the same interface, or (if your kernel allows it) by inline use of the RDPMC instruction.

If the code does not need to be portable it can be a lot shorter than Intel's PCM. The information is all provided in section 18.2.2.2 of Volume 3 of the Intel Architecture SW Developer's Manual (document 325384-053).

There is a learning curve to get this stuff working. I generally use the "rdmsr" and "wrmsr" routines from "msrtools-1.2" to read and write the MSRs from the shell level. When I need to read/write the MSRs in C code, I just copy the relevant lines of code from "rdmsr.c" and/or "wrmsr.c" to use "pread()" or "pwrite()" instructions to read/write the MSRs via the /dev/cpu/*/msr device drivers.

Once you are familiar with the access mechanisms, then the procedure is:

Enable the counters by writing bits 3:0 of MSR IA32_PERF_GLOBAL_CTRL on each logical processor.
Program the counters by writing the MSRs IA32_PERFEVTSEL0 through IA32_PERFEVTSEL3 on each logical processor.
Read the counters by either using MSR IA32_PMC0 through IA32_PMC3 on each logical processor, or by using an inline assembler function to execute the RDPMC instruction with the desired counter argument (in register ECX), and concatenate the EDX (high-order 32 bits) and EAX (low-order 32 bits) to get the full 48-bit counter value.
1. The uncore counters described in Table 19-16 of Vol 3 of the SW Developer's Manual only need to be read once on each chip -- but there is no harm in reading them using each core - you will just get multiple copies of the same result.
2. Since you are handling all of these accesses manually, you will have to monitor for overflow/wraparound of the counters and compensate when this occurs.

yang_s_ · ‎03-25-2015

That's really helpful, John. I tried the msr-tools like you said. And read some events on linux. The shell code is like this:

#!/bin/bash

#set IA32_PERF_GLOBAL_CTRL

wrmsr 0x38f 0x70000000f

#set IA32_PERFEVTSEL0-3

for (( i=0; i<4; i++ ))

do

wrmsr 0x186 0x130201;

wrmsr 0x187 0x130301;

wrmsr 0x188 0x130302;

wrmsr 0x189 0x130320;

done;

#read IA32_PMC0-3

while (true)

do

for (( i=0; i<4; i++ ))

do

r1=`rdmsr 0xc1`;

r2=`rdmsr 0xc2`;

r3=`rdmsr 0xc3`;

r4=`rdmsr 0xc4`;

done;

The value I got did not change. Is this right? Am I doing the right thing?

And Is the counters can be reset? I mean if overflow/wraparound occurs should I reset the counters. If I want to read another set of events, can I just adjust the IA32_PERFEVTSEL0-3?

McCalpinJohn · ‎03-25-2015

I think you are missing the "enable" bit in the PERF_EVT_SEL registers -- bit 22. They should look like 0x00430301 (for your first example).

You can clear the registers by writing 0 to the IA32_PMC* registers. I don't usually bother -- I make sure that I read the counters often enough that they cannot wrap around more than one time (2^48 increments), then I simply add 2^48 to the result if the ending counter value is smaller than the initial counter value.

yang_s_ · ‎03-29-2015

Hi John,

I tried to read the events,UNC_GQ_ALLOC.WRITE_TRACKER ( 03H 20H), UNC_GQ_ALLOC.PEER_PROBE_TRACKER(03H 40H ), table 19-16 in intel developer manual volume 3. But it seems they are unreadble.

wrmsr -p$j 0x188 0x432003; // UNC_GQ_ALLOC.WRITE_TRACKER
wrmsr -p$j 0x189 0x434003; // UNC_GQ_ALLOC.PEER_PROBE_TRACKER

And I also want to read UNC_GQ_OCCUPANCY.WRITE_TRACKER, UNC_GQ_OCCUPANCY.PEER_PROBE_TRACKER, which are not provided in the manual. They are pointed out in Intel® 64 and IA-32 Architectures Optimization Reference Manual 2014 Order Number: 248966-030 (Appendix B). But there only is UNC_GQ_OCCUPANCY. READER_TRACKER in able 19-16 in intel developer manual volume 3. How should I get the other two events?

Thank you.

McCalpinJohn · ‎03-29-2015

It took me a while, but I think I figured this out....

According to Section 18.7.2 of Volume 3 of the Intel SWDM (revision 053), the Nehalem processors have a different set of 8 performance monitor registers for the uncore. The functionality is described in Section 18.7.2, while the MSR numbers for these uncore performance monitoring registers are shown in Table 35-12 in section 35.5.1. Westmere-based systems inherit these MSRs from Nehalem, and add the MSRs in section 35.6, Table 35-14. Only one of these is directly relevant to performance counters -- MSR 1A7H, which is the control register for the second "offcore response event" (core performance counter event 0xBB).

So instead of programming the Core performance counters (MSRs 0x186 to 0x189) with the events in Table 19-16, you need to be programming MSRs 0x3C0 to 0x3C7 (along with the associated "global" control registers. Then you will read the counts from the uncore performance monitoring MSRs (0x3B0 to 0x3B7) instead of from the core performance counter MSRs (0xC1 to 0xC4).

I think I knew about this at one time, but forgot about it when the Sandy Bridge systems came out with a completely different approach to uncore performance monitoring.

yang_s_ · ‎03-30-2015

Yeah, this really helps me. Thank you very much. If I could meet you, I will give you a real hug!

yang_s_ · ‎04-08-2015

Hi John, I have some wonderings again.

I read QPI events from counters using MSR. I got the values of UNC_GQ_OCCUPANCY.READ_TRACKER and UNC_GQ_ALLOC.READ_TRACKER, and I calculated the latency of read as UNC_GQ_TRACKER_OCCUP.RT/UNC_GQ_ALLOC.RT. When I run a benchamrk of spec2006, the latency was big at first and became smaller gradually. Then it seems steady. Even the benchamrk is over, the lattency do not drop down immediately. It would still be steady for a while, and then became the same as before the application runs. Why is this happening? I didn't handle the overflow of counters, is that related?

McCalpinJohn · ‎04-08-2015

I think it would be best to start with some carefully designed tests to develop an understanding of these events. (I have not used the uncore counters on the Nehalem/Westmere systems, so I don't have any immediate insights.)

I would probably start by running simple tests like STREAM and "lat_mem_rd" from lmbench. Since you have root access to program the MSRs, you can put the MSR programming in-line in the program under test to get counts across specific loops or functions. I simply copied the bits of code from "rdmsr.c" and "wrmsr.c" that open the /dev/cpu/*/msr files and put it in the setup of my program, then I copied the code that reads/writes the registers into a simple library that I can call before and after loops and/or functions.

You will need to check the documentation in Chapters 18 and 19 of Volume 3 of the SWDM to be sure about the width of these registers, then compute the minimum amount of time it takes for a full cycle. (This is almost never a problem for 48-bit counters, but I don't know if these counters are that wide. I seem to recall seeing counters in Intel processors that are 32 bits wide, 40 bits wide, 44 bits wide, 48 bits wide, or 64 bits wide.)

With the hardware prefetcher disabled, you should be able to compare the latency estimates from "lat_mem_rd" against the uncore GQ tracker average latency. These should be strongly correlated as you change the location of the data (local vs remote) or change the CPU frequency (using acpi-cpufreq and the "userspace" governor). The average number of queue entries should be 1 for "lat_mem_rd" with the prefetchers disabled.

The relationship "latency * bandwidth = concurrency" is typically used to compute any of the three values when the other two are known, but if all three values are known, the equation can be used as a "sanity" check on the values. When you are looking at a queue-based system, the equation should be using "occupancy" instead of "latency", but in a well-designed system these values are typically close to each other.

The configuration bits for the UNC_GQ_OCCUPANCY.WRITE_TRACKER and UNC_GQ_OCCUPANCY.PEER_PROBE_TRACKER are included in the Intel Amplifier XE ("VTune") database files "corei7_unc_db.txt" and "corei7jf_unc_db.txt", along with a variety of additional related events that can be used to compute average occupancy & latency. I am not sure why Table 19-16 in V3 of the SWDM does not include the Umask values for the Write Tracker and Peer Probe Tracker for Event 0x02 (UNCORE_GQ_OCCUPANCY), but the values are the same as those used in Event 0x03 (UNCORE_GQ_ALLOC).

yang_s_ · ‎08-03-2015

Hi, it's me again.

Are the uncore events and monitoring counters under IvyBridege microarchitecture the same with that under Westmere?

I looked up the Intel Software Developer's Manual (order number: 325462-052US), and found that chapter 18 does not the say performance monitoring about IvyBridege and the uncore events in chapter 19 only appear in the description of Westmere.

McCalpinJohn · ‎08-04-2015

Performance monitoring on Ivy Bridge is the same as on the corresponding Sandy Bridge processor. There are three different uncores -- one for the "client" part, one for the standard server part, and one for the high-end Xeon E7 part. These all have completely different uncore performance monitors, but the Ivy Bridge versions of each should be the same as the Sandy Bridge versions.

yang_s_ · ‎08-06-2015

Thanks John. I read the performance monitoring on Sandy Bridge processor as you suggested. I found that I only need to monitor the GQ events (like the posts above). Should I still use the MSR registers like IA32_PERF_GLOBAL_CTR and IA32_PERFEVTSELX to do this on Ivy Bridge? Are the operations the same with that on Westmere?

I do not have a machine to try this right now, can you give me a hint with these problems? Thanks again.

McCalpinJohn · ‎08-06-2015

The core performance counter infrastructure on SandyBridge/IvyBridge is nearly identical to the infrastructure on Westmere, so most software can be easily re-used, but many of the specific events and umasks have changed, so some changes are required.

The uncore performance counter infrastructure in SandyBridge/IvyBridge is completely different than in Nehalem/Westmere, and it is almost completely different across the three different uncores used in SandyBridge/IvyBridge systems.

So the first thing you will need to figure out is which of the platforms you are likely to get access to. The best documented platform is the Xeon E5-2xxx, which is a 2-socket server platform similar to your Xeon E5620. The uncore counters are documented in the Xeon E5 Uncore Performance monitoring guides -- document 327043 for Sandy Bridge Xeon E5 processors and document 329468 for the Ivy Bridge Xeon E5 v2 processors.

For all of the Xeon E5 uncores, performance monitoring is done via either MSRs (for some uncore units) or via PCI configuration space (for the other uncore units). I think that the events analogous to the GQ_TRACKER events that you were looking at on the Westmere platform are the TOR (Table of Requests) events in the uncore CBo (Coherence Box), described in Section 2.3 of document 329468 (for Xeon E5 v2/Ivy Bridge). These are accessed by MSRs, so at least the software infrastructure should be similar.