Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

How is the collection of MSR registers

SB17
Beginner
2,180 Views

Hi all

Please explain one uncertain
That collect MSR/PCM c
ounters ?

As I understand it, I can collect counters through the perf or through driver on Linux allows to read and write in the MSR registers.

Collected counters show the number of events for each thread (counters binding with thread) or the total number of events occurring in the device ( without binding to the thread, for example the total number of load or store event of all threads of one processor/core without bindig to thread) ?

If MSR counters binding to the thread, whether this means that when you switch context OS(or hardware) save MSR registers in some buffer, and then again when loading context when MSR registers loaded into some buffer ?

This method should theoretically make the overhead of profiling.

What am I wrong?

Sorry for my english

0 Kudos
14 Replies
Patrick_F_Intel1
Employee
2,180 Views

Hello Black,

Sorry to delay answering your question.

I'm not really sure what you are asking though. I'm not really sure if you are asking about PCM specifically or just perfmon MSRs in general. But here is an attempt at an answer:

1) Do counters show events per thread or events per device? It depends on the counter. The 'scope' of the counter tells you what area of the chip the counter can access. Some counters (like the general cpu counters) count events at the cpu level (or they can count at the core level). So to read these counters you would need to bind to the particular cpu whose counter you want to read. Other counters have 'processor' scope. That is, the value in the counter is the same no matter which cpu you read it from. Other counters might have uncore scope, or ring scope, etc.

2) when you switch context on a cpu, do you have to save the MSRs and load the MSRs back when you switch back in? This is an area with lots of discussion. Generally the OS doesn't save off the perfmon (hw performance counter related) MSRs and so the OS doesn't reload them. In general, if the perfmon counters are in use, they just stay running regardless of which thread gets swapped in/out of the cpu. You can tell the counters to only count ring0 events or ring3 events or both but you can't tell the counter to "only count for my thread".

Hope this helps,

Pat

0 Kudos
SB17
Beginner
2,180 Views

Thanks Pat

I assumed it, but wanted to hear a professional opinion

 

0 Kudos
Bernard
Valued Contributor I
2,180 Views

>>>they just stay running regardless of which thread gets swapped in/out of the cpu.>>>

For heavily loaded system it could skew the results.I was thinking about the boosting priority of currently executed thread(beign profiled) to realtime priority in order to keep it pinned to the core until the measurement is over.

0 Kudos
Patrick_F_Intel1
Employee
2,180 Views

iliyapolak wrote:

>>>they just stay running regardless of which thread gets swapped in/out of the cpu.>>>

For heavily loaded system it could skew the results.I was thinking about the boosting priority of currently executed thread(beign profiled) to realtime priority in order to keep it pinned to the core until the measurement is over.

Hello iliyapolak,

When you say 'it could skew the results', if by "it" you mean "the running counters" then this is not something to worry about. I don't think anyone has been able to show/measure any extra overhead from having the counters running. Any extra overhead comes from reading the counters. But the overhead of utilities like PCM is pretty low (probably 1-20 milliseconds per iteration... but it has been a while since I checked it).

For utilities like PCM which run in 'counting mode' (where you just read the counters after sleeping for 1 second or so), if the system is heavily loaded then usually the worst that happens is that PCM won't run exactly when you want it to... so you don't get exactly 1 second intervals for instance.

Utilities like VTune, which run in 'sampling mode' (where you take an performance monitoring interrupt (PMI) each time a counter overflows) can induce a lot of overhead if you sample too frequently. Usually 1000 PMI/second has very small perturbation of a system. Usually when I run something like VTune (or perf in sampling mode) I measure performance of my app with and without sampling to make sure I'm not modifying the performance of my app more than intend.

Pat

0 Kudos
Bernard
Valued Contributor I
2,180 Views

No I was not talking about the overhead of the measurement.I meant that result will not be accurate because counters will not be pinned to some thread or currently executing thread.So when the thread will be swapped out the counter state will not be saved by OS and next ready thread will cause the counter to be incremented.

0 Kudos
Patrick_F_Intel1
Employee
2,180 Views

Yes, if one doesn't keep track of from which cpu one is reading the counters, then one can get garbage results.

PCM and other utilities handle this by pinning to a specific cpu before they read the counters. This way we know, if we need to say, get the difference of the current and previous value of the counter, that we are subtracting the correct cpu's counter.

0 Kudos
Bernard
Valued Contributor I
2,180 Views

Probably done by calling SetProcessAffinityMask on Windows.

0 Kudos
SB17
Beginner
2,180 Views

Sorry again to delay answering

The question arose when I started thinking about whether profiling considered noise of operating system, device drivers and system applications.
 
Thanks for the interesting answers
0 Kudos
Bernard
Valued Contributor I
2,180 Views

Hi Black S

What do you mean by writing "noise of operating system"?

0 Kudos
SB17
Beginner
2,180 Views

if the accumulated number of load and store in memory, double-precision operations, the operating system, system services should also contribute to the total number of events. It is clear that the percentage is very small

0 Kudos
McCalpinJohn
Honored Contributor III
2,180 Views

One does have to be careful with using MSRs to access performance-related information because the overheads can be relatively large and the standard access mechanism (at least in Linux: /dev/cpu/*/msr/) has no API for reading lists of target registers with a single call to the driver. 

It is easy enough to run a case multiple times with different (known) amounts of "work" and subtract the counts to estimate the overheads, but it would be a lot of work to get a solid understanding of the intrinsic variability of the overhead in terms of all of the performance events that you might want to measure.   PCI configuration space accesses and general MMIO accesses are possibly even worse than MSR accesses in terms of overhead, but I have been afraid to measure these.   One result that I recall is an average overhead of something like 7 microseconds to read an MSR on the same chip where my process is running (using code based on "rdmsr.c" from msrtools-1.2) and 10 microseconds to read an MSR on the other chip in a two-socket system.    Note that each Xeon E5-2600 family processor chip has 83 performance counters defined in the uncore (if I added the numbers in Table 1-1 of the Xeon E5-2600 series Uncore Performance Monitoring guide correctly), with Table 1-2 showing that 41 of these are in MSR space and Table 1-3 showing that the remaining 42 are in PCI configuration space.  Not all of the counters are likely to be useful in any single measurement scenario, but it is very easy to imagine wanting to read all 32 CBo counters and all 16 of the programmable IMC counters at once.  With the existing kernel interface it would probably take O(1000 microseconds) per chip -- corresponding to about 25 million aggregate core cycles (8 cores * 3.1 GHz).   That is an unpleasant amount of overhead for any methodology except whole-program monitoring.

Even building a dedicated kernel module to retrieve all of these counters in a single call would not provide a mechanism that anyone could reasonably call "lightweight" (though I will probably have to do it just to find out how bad it is).

So instead of being able to install in-line instrumentation in my codes when I need to access uncore counters, I have to build a specialized test code that I hope does the same thing as the application, but does it a programmable number of times so that I can apply whole-programming monitoring to a set of extended executions.  Obviously this requires a lot of work and is only practical if I already understand what the target code is doing.   It would be much nicer if the uncore performance counters could be mapped to core performance counters and then read in user-space.  My measurements of RDPMC overhead are in the 10's of cycles in user space -- much more practical than the 10's of thousands of cycles for driver calls to get MSR or PCI configuration space values.

0 Kudos
Bernard
Valued Contributor I
2,180 Views

>>> the operating system, system services should also contribute to the total number of events. It is clear that the percentage is very small>>>

Yes that is true.

0 Kudos
Singh__Nikhilesh
Beginner
2,180 Views

I have a doubt here, @John. From within the kernel, what is the most lightweight to read a performance counter?

0 Kudos
McCalpinJohn
Honored Contributor III
2,180 Views

The lowest overhead for reading a performance counter (a few 10's of cycles) is RDPMC -- but it has to be run on the logical processor for which you want the counts.  If you are running general (unbound) kernel code, this is usually done using an interprocessor interrupt, which will have an overhead of thousands of cycles.   If you are running in a kernel thread that is already bound to the target logical processor, then the RDPMC instruction can be used directly.

For performance counters in the "uncore" that are read by MSRs, any core on the same die can execute the RDMSR instruction.  If you know that you are in a kernel thread bound to any of the cores on the target die, you can execute RDMSR directly -- otherwise you need to use an interprocessor interrupt.   Latency for executing the RDMSR instruction varies by the MSR number, but is typically a few hundred cycles for reading MSRs that are external to the core.

For performance counters in the "uncore" that are located in PCI Configuration Space, any core can execute an uncached, aligned, 32-bit load from the corresponding memory-mapped address.   These typically average a few hundred cycles for uncore devices on the same die.   I have not measured cross-socket latency explicitly in these cases.

0 Kudos
Reply