Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

How to find the Individual core L1 and L2 cache hit/miss on the multicore environment

Hemanth_K_
Beginner
2,313 Views

Scenario : 2 Process are executing on 2 different cores respectively of a processor. How can i measure Individual core L1 and L2 Cache hits and miss for each core assuming hyper threading are disabled. Performance Counter monitors are not providing me individual breakdown i believe. So is there any way i can measure the individual core L1 and L2 cache hits and misses.

0 Kudos
17 Replies
TimP
Honored Contributor III
2,313 Views

Does filtering by thread help?

0 Kudos
Bernard
Valued Contributor I
2,313 Views

Did you try to use VTune?

0 Kudos
Ks_Hemanth
Beginner
2,313 Views

@Tim Prince: Filtering by thread might Help. But i am not able to get any information by thread data. Can you Please help how to get those data by thread. 

@iliyapolak: For my system i am unable to use Vtune as my platform is neither Windows not Linux. But i can write the sample C program to get the PMC counter in the platform i am working on. Can u please suggest if there are any other ways to obtain the cache data/core or data/thread. 

0 Kudos
TimP
Honored Contributor III
2,313 Views

As Ilya hinted, the filter by thread is a useful feature of Intel VTune.  oprofile should support --separate=thread, segregating the counts into separate files by threads.

0 Kudos
McCalpinJohn
Honored Contributor III
2,313 Views

The hardware performance counters in each processor core measure only events for that core.   The software might combine the counts, but the hardware counters do not (and cannot) combine cache hits across separate cores.

So the answer depends on what interface you are using to get to the performance counters, and on whether your two processes are "bound" to separate cores.

0 Kudos
Hemanth_K_
Beginner
2,313 Views

@ John : Yes process which i am running are bound to separate core like Process A will run only on Core 0 and Process B will only run on Core 1. No Hyper threading is enabled. 

Then coming to your suggestion, if i call the performance monitor counter a main thread of Process A, I will get only L1 and L2 Cache Hit/Miss of that core only. As the Process A is bound to Core 0, Main thread also Bound to Core 0 only. Is my understanding correct ??

@Tim : I am unable to integrate/use Vtune in my platform. Is the Vtune code is available as open source so that i can go through the code and modify those per thread PMC calculation of my functions?

0 Kudos
Hemanth_K_
Beginner
2,313 Views

@ all: I want to monitor the following events through PMC per core like 

L1D_REPL

L2_RQSTS_LD_MISS

0 Kudos
McCalpinJohn
Honored Contributor III
2,313 Views

We need some clue about what software you are running (both the OS and the software interface to the performance counters that you are currently using) to offer any additional advice.

 

0 Kudos
TimP
Honored Contributor III
2,313 Views

Hemanth K. wrote:

 

@Tim : I am unable to integrate/use Vtune in my platform. Is the Vtune code is available as open source so that i can go through the code and modify those per thread PMC calculation of my functions?

 

oprofile is open source and might be usable on an OS supporting equivalent facilities as linux

I suppose papi might be interesting. 

0 Kudos
Bernard
Valued Contributor I
2,313 Views

VTune is not open source project. I think that only way to go is trying to access PMC counters by the means of writing your own routines.

Btw, what is your OS?

 

 

0 Kudos
Bernard
Valued Contributor I
2,313 Views

@Hermanth

I have a weird problem with IDZ website mainly because I cannot paste quoted text. I am reffering to your post no. 7

Thread can be scheduled to run on different core and that is done by the OS scheduler unless affinity was set for specific core. On Win thread can run up to few miliseconds in optimal case. Process is not bound to any core because process is some kind of container what is beign executed is thread.

0 Kudos
Bernard
Valued Contributor I
2,313 Views

Small correction of my sentence related to Process on Windows. As I wrote in my previous post Process is not directly executed by CPU, but Process has its internal structure representation which of course is manipulated by the CPU.

0 Kudos
Hemanth_K_
Beginner
2,313 Views

@John D. McCalpin  @iliyapolak: I am unable to Disclose details of the operating system. Basically its a Real time operating system. which has the feature to restrict once process to execute in one particular core. my operating system support this. We have different scheduler for each core.And In my case Hyper threading is disabled. Hence if write the service routines In the main thread of a Process A, assuming which run on Core 0, read and configure the PMC from this thread only. Will I get the particular Core cache L1 and L2 miss/Hit data ?? 

Baseline question Is Performance counter monitors are available per logical core ?? in any Intel processor after Nehalem architecture. 

I can write my own service routines for the process to read the PMC directly. 

@John D. McCalpin  @iliyapolak @Tim Prince thanks you guys for taking some time to help me out in understanding my problem and helping me resolve it.

0 Kudos
Patrick_F_Intel1
Employee
2,313 Views

Hello Hemnanth,

Just to be precise... We usually don't talk about 'logical cores'. The cores are physical. We do talk about 'logical threads' or sometimes 'logical hardware (HW) threads' (adding the hw to differentiate between software and hw threads) ... in particular if HT is enabled then there are 2 logical hw threads per physical core.

In any case, to answer your question, yes, performance monitor counters are available on each core on any Intel processor after nehalem. But you probably want to know more than that...

On recent (past year or two) Intel(r) Atom(tm) based CPUs, the L2 is shared between cores. According to my notes, on all "big core" CPUs (nehalem, sandy bridge, ivy bridge, haswell, etc) each core has its own L2 so you should be able to get 'per core' L1, L2 stats. If HT is enabled, you would need to know which 2 logical HW threads share a core and then sum the stats from each logical hw thread.

Now to get to what you really want to know... can you get L1 and L2 hit/miss stats... you can certainly get L1 and L2 events. Whether the events count exactly what you want is another matter. There are all kind of interactions between the L1, L2 and L3. Prefetchers can change hit/miss stats. You would need to look at the particular events for the type of CPU you have (say ivy bridge desktop cpu). If you have the same CPU type running windows or linux, then you can install VTune on that box and see what events are available for that CPU. Or you can get the same sort of event-availability info from linux perf (assuming you are running linux perf on the same target cpu). With this info you should be able to see the event code and umask you need to program into the PMC general counters. You need to be able to read/write MSRs (hopefully with some API that doesn't crash the OS if you read/write an invalid MSR or try to set a reserved bit in a valid MSR). And you need to know the MSRs to read and write.

Hope this helps.

Pat

0 Kudos
McCalpinJohn
Honored Contributor III
2,313 Views

The core performance counters are "private" to each logical processor.  That is to say that each logical processor has its own set of PERFEVTSELx MSRs that control what events increment that logical processor's corresponding IA32_PMCx MSR.   These hardware-level features are described in great detail in Chapters 18, 19, and 35 of Volume 3 of the Intel 64 and IA32 Software Developer's Manual (Intel document 325384, revision 053, January 2015).

As Patrick Fay notes, this does not mean that the events that cause a counter to increment are specific to the logical processor doing the counting.  Most of the time they are, but there are exceptions.   The most obvious exception is if the system is running with HyperThreading enabled and bit 21 of a PERFEVTSEL MSR is set.   In this case you are explicitly requesting that the event be incremented for any thread sharing the same physical core.  (This can be useful or not, but it is nice to have the option available.)

Even when the events are specific to the logical processor doing the counting, that does not mean that the events are *independent* of the activity of other logical processors.  With HyperThreading enabled, the logical processors sharing a physical core will interfere with each other in many ways, causing many of the performance counts to change.  Even without HyperThreading, activity on other processors will effect cache hit/miss rates in shared caches, and may effect memory latency, available memory bandwidth, etc.

The hardware-centric notes above are the easy part.  The more difficult part is why I keep asking about the software.  The hardware instructions to read and write MSRs can only be executed in kernel mode, so if you are trying to monitor a user-space program the "service routines" used to program the counters must transition into the kernel.  While in the kernel, whatever driver code you are using has to ensure that the kernel code that reads or writes each of the relevant MSRs is running on the correct processor.   With Linux, there is a supported facility to do this through a set of device drivers named "/dev/cpu/*/msr" (where "*" is the logical processor number).   Other systems may not support such functionality, or may support the same functionality using entirely different mechanisms.   On systems that "support" performance counters, the driver software will make many assumptions about how the performance counters are "supposed to be" used, and it may be very difficult to use the hardware in any other way.  On Linux systems, for example, the default behavior is for the kernel to "virtualize" the counters by process.  This "virtualization" extends the 48-bit hardware counters to "virtual" 64-bit counters, but also (by default) only allows the counters to be incremented while the specific process being monitored is executing.  The virtualization also creates a massive increase in the overhead of reading performance counters -- from ~30-40 cycles at the hardware level to (typically) a few thousand cycles through the virtualized OS interface.   Note that this is not required by the hardware -- it is the consequence of decisions made by the authors of the software about how the counters are "supposed to be" used.

If you have the ability to control the full software stack being used (including the kernel driver code), then you will have much more flexibility in how you use the hardware counters.   I typically avoid activating the Linux "perf" subsystem, and directly control the counters using the MSR interface.  At the application level I can avoid the majority of the overhead by binding processes to specific cores and using the RDPMC instruction to read the hardware counters directly.  This eliminates almost all the overhead and gives me direct access to everything that the hardware has counted on the specific logical processor where I am running.  Because I don't have a "virtualization" layer, it is my responsibility to ensure that the counters are read often enough to unambiguously detect and correct for wraparound of the 48-bit counters.

0 Kudos
Bernard
Valued Contributor I
2,313 Views

As Pat wrote there are no logical cores. Put it simply single which has HT has also two sets of kind of Front-End which itself comprises sets of architectural registers, APIC and counters. Execution engine stacks: SIMD AVX/SSE Integer and FP and x87 are both shared by two hardware threads.

0 Kudos
Hemanth_K_
Beginner
2,313 Views

 

Thanks John D. McCalpinPatrick Fay (Intel) and iliyapolak for comprehensive description of the Performance monitoring Counters. 

Now I am able to see the PMC counter values per process scheduled on each core. I used direct MSR reading from the execution threads to get the numbers. 

0 Kudos
Reply