Solved: PMU for multi threaded environment

Sakthivel_S_ · ‎08-05-2016

I have an CentOS with Linux kernel 3.10.0-327.22.2.el7.x86_64 , which is powered up with Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz .

To optimize out my huge application code, i am planning to measure PMU for L1,L2,L3 misses branch prediction misses , I have read related Intel documents but i am unsure about the below scenarios.could some one please clarify ?

I just reset all couonters and configured needed feilds, then i am doing below..

    if(ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_start) == -1 ) {
        perror("ioctl msr_start failed");
        exit (0);
    }
    my_program();                 
    ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_stop);
    printf("L2 Hit:    %7lld\n", msr_stop[2].value);
    printf("L2 Hit all:     %7lld\n", msr_stop[3].value);

Since this is general purpose multi threaded OS,

1.what will happen if my process is scheduled out when my_program() is running, and scheduled to another core?

2.what will happen if process scheduled out and schedule back to same core again, meanwhile some other process reset the PMU counters?

Thanks

Sakthivel S OM

McCalpinJohn · ‎08-11-2016

If you use counters inline with a pinned process, then you will normally get counts for all activity on that logical processor during the interval. This will include your pinned process plus any OS activity that happens to be scheduled on that logical processor plus any other processes that the OS may run on that logical processor during the interval. I consider this to be a "good thing", because it provides extra information about the extent to which my process may have been perturbed by OS activity.

If the OS or another job scheduled by the OS on your logical processor programs the performance counter control registers or resets the counts, then your final values won't mean very much. In our "tacc_stats" monitoring system, we read the performance counter control MSRs as well as the counts at each sample, so we can detect whether a user job has modified the performance counter programming. (User jobs have permission to use the performance counters via the "perf events" subsystem, but I don't think it is possible for "perf events" to modify the counter programming and then restore it to the original "tacc_stats" values.)

There have been a number of approaches to reserving and sharing performance counters, including both software-only and combined hardware+software approaches, but at this point there is not a "standard" approach. (It looks like Intel has a hardware-based approach using MSR 0x392 IA32_PERF_GLOBAL_INUSE, but I don't know what platforms support it.)

View solution in original post

SergeyKostrov · ‎08-05-2016

>>1.what will happen if my process is scheduled out when my_program() is running, and scheduled to >>another core? >> >>2.what will happen if process scheduled out and schedule back to same core again, meanwhile some other >>process reset the PMU counters? In general, for both cases when a processing migrates from a Logical CPU 1 to a Logical CPU 2 there will be a performance penalty since all data and codes currently fetched will be evicted from L3, L2 and L1 caches. That performance hit could be significant and processing times always affected, like 25% slower, or 50% slower, or even more, I'd like to warn you that what you're going to do is Not an Optimization and it is rather a Tuning, and I don't think your objectives could be achieved since, as you've mentioned, your application is huge one. I think you need to follow a different path, that is a Three Stage Process: - Clean up your Codes - Optimize Algorithms, Code blocks, etc - Tune up Algorithms, Code blocks, etc ( use VTune in order to see what is going on with L3, L2, L1 caches, etc )

Sakthivel_S_ · ‎08-06-2016

Hi Kostrov

Yes, its kind of tuning task of application.I have to measure L1,L2, and L3 cache misses and branch mis prediction using PMU., and we will take further action based on the counter values (to minimize the misses if possible).

I understood about the performance penalty when processing gets migrated in different cores. But what will happen on my PMUs which is programmed for logical core-X and if process migrated to core-Y or scheduled back to core-X after some times due to scheduling algorithms.

How to make sure that we are accepting the correct values from PMU counters.?

Thanks

Sakthivel_S_ · ‎08-09-2016

looking forward your answer on this. pls clarify the requested details

McCalpinJohn · ‎08-09-2016

If you want to use the performance counters without pinning your threads to specific cores, then you will need to use a software infrastructure that virtualizes the performance counters by process. The "perf events" subsystem in Linux does this by default when you use the "perf stat" or "perf record" commands.

For inline code instrumentation you should be able to use the "perf events" API, but the documentation is minimal. Some resources are available at http://web.eece.maine.edu/~vweaver/projects/perf_events/faq.html

I do inline code instrumentation all the time using direct access to the MSRs, but always with the application threads bound to specific logical processors. (You can use "pread()" on the /dev/cpu/*/msr device files to read the MSRs -- this may be a bit easier to read than IOCTL-based code. The codes "rdmsr.c" and "wrmsr.c" from "msr-tools-1.3" provide excellent examples.)

jimdempseyatthecove · ‎08-10-2016

>>But what will happen on my PMUs which is programmed for logical core-X and if process migrated to core-Y or scheduled back to core-X after some times due to scheduling algorithms.

You will have to ask on the VTune forum for confirmation, but I believe, when you have VTune setup to sample for the process (as opposed to all the system), that the sample counters follow the process if/(as it is) migrated from place to place. This said, as Sergey pointed out, migration will adversely affect L1/L2 cache hit/miss ratios, as well as other processes affecting L3 and L2/L1 depending on if the other process is running in a hardware thread that is sharing the L2 and/or L1 (HT siblings share L1, and on some processor designs 2 cores may share an L2).

What I suggest you do (to improve usefulness of performance counters) is to affinity pin your process threads to a subset of the available logical processors. Take ones at the high end of the available logical cpus. And do your sampling runs when your system is lightly loaded. You may also need to adjust your workload size such that its working set (cache used per cache available) approximates that while under full use. This will help you investigate your algorithms under favorable conditions. BTW if other users are doing the same thing, then you will have to cooperate with them as to when and which system logical processors you will be using for testing purposes.

Jim Dempsey

SergeyKostrov · ‎08-10-2016

>>...How to make sure that we are accepting the correct values from PMU counters? I would create a test case that simulates your working processing ( at least some parts! ) to make sure that PMU counters do what you need. Take into account that in your huge application there is a possibility of getting false impression that everything is correct, by analysing just counters, because a data set of being processed is small and fits into all Lx caches.

Dharmaray_K_ · ‎08-10-2016

John McCalpin wrote:

If you want to use the performance counters without pinning your threads to specific cores, then you will need to use a software infrastructure that virtualizes the performance counters by process. The "perf events" subsystem in Linux does this by default when you use the "perf stat" or "perf record" commands.

For inline code instrumentation you should be able to use the "perf events" API, but the documentation is minimal. Some resources are available at http://web.eece.maine.edu/~vweaver/projects/perf_events/faq.html

I do inline code instrumentation all the time using direct access to the MSRs, but always with the application threads bound to specific logical processors. (You can use "pread()" on the /dev/cpu/*/msr device files to read the MSRs -- this may be a bit easier to read than IOCTL-based code. The codes "rdmsr.c" and "wrmsr.c" from "msr-tools-1.3" provide excellent examples.)

Thanks John, Even I intend to inline code instrumentation using direct read/write to MSRs and also plan to pin my code to specific logical processors. But the question still remains of pre-emption ..... My process gets scheduled out and another process comes in.
Case A: The second process that comes in does not measure any counters. Will my instrumentation of measuring counters still hold good?

Case B: The second process that comes in also instruments and measures counters. Lets say it programs IA32_PERFEVTSELx. Wil my instrumentation be good enough?

In both cases Counters should not be counting for the second process and my counter values should be retained after my process comes back.

jimdempseyatthecove · ‎08-11-2016

>> I intend to inline code instrumentation using direct read/write to MSRs...The second process that comes in also instruments and measures counters

This is why you should consider using a driver based management tool, such as VTune, which can work in conjunction with the O/S thread scheduler. Someone on the VTune development team should answer the question as to if VTune counters are managed across context switches (and optionally indicate the number of, and duration of preemptions).

Also, if you do intend to manage the registers yourself (from within the application), then consider placing the clear and read counters around sections of code that are expected to run for less than the typical O/S time slice quanta. Then statistically sample, building up a histogram (or record each set of samples into a buffer). Then at the end of the run, if you are not interested in incorporating the adverse interaction of other processes, you can disregard the samples that you deem have been preempted.

Jim Dempsey

McCalpinJohn · ‎08-11-2016

If you use counters inline with a pinned process, then you will normally get counts for all activity on that logical processor during the interval. This will include your pinned process plus any OS activity that happens to be scheduled on that logical processor plus any other processes that the OS may run on that logical processor during the interval. I consider this to be a "good thing", because it provides extra information about the extent to which my process may have been perturbed by OS activity.

If the OS or another job scheduled by the OS on your logical processor programs the performance counter control registers or resets the counts, then your final values won't mean very much. In our "tacc_stats" monitoring system, we read the performance counter control MSRs as well as the counts at each sample, so we can detect whether a user job has modified the performance counter programming. (User jobs have permission to use the performance counters via the "perf events" subsystem, but I don't think it is possible for "perf events" to modify the counter programming and then restore it to the original "tacc_stats" values.)

There have been a number of approaches to reserving and sharing performance counters, including both software-only and combined hardware+software approaches, but at this point there is not a "standard" approach. (It looks like Intel has a hardware-based approach using MSR 0x392 IA32_PERF_GLOBAL_INUSE, but I don't know what platforms support it.)

Sakthivel_S_ · ‎08-12-2016

Thanks Jim. I understand that,we can use "perf events" for a whole process(PIDs) . But as @John said, here i am trying inline instrumentation using direct access to MSRs. I would like to know if is there a provision to measure performance of the application code block by block (inline) using V-Tune? or please explain how V-Tune is differed/advanced from any other measurements tools (i.e., like perf) in this respect?

jimdempseyatthecove · ‎08-12-2016

What I do know of VTune is, when using the event based sampling, that you can collect events on ~~a process-by-process~~ an intra-process basis (as well as thread-by-thread within process with the newer versions) .OR. on a system-wide basis. VTune uses a driver to control the sampling.

My assumption (educated guess) is that in order to provide ~~process-by~~ intra-process sampling, that the driver must be aware of thread context switches and sum accordingly on context-out, and zero upon context-in. Though a VTune expert could pipe in on this assumption.

Your application self monitoring cannot do this, other than by statistically sampling, and by discarding samples that appear to have been preempted. Example: if a particular section of (your) monitored code takes 1ms to complete, and you periodically observe samples in the 10's or 100's of ms, then you (qualified) might assume that these run-throughs of the code were preempted. You also might want to disregard the first pass through after and observed preemption as this may experience inordinate/non-typical cache misses (this is something you cannot do with VTune).

Keep John's comment/advice in mind that you should have an interest in the performance of the program without interference of the O/S and/or other processes... as well as... under typical system load. An application best tuned for the former, is not necessarily best tuned for the latter.

Jim Dempsey

McCalpinJohn · ‎08-12-2016

I could be wrong, but my understanding is that VTune always uses a sampling-based approach to performance monitoring, which is fundamentally inconsistent with the interval-based approach. VTune has an API for "instrumentation" (https://software.intel.com/en-us/node/544199), but I think this is to provide additional ways to categorize/classify samples, not to provide a way to get interval measurements.

The "perf events" API (discussed at http://web.eece.maine.edu/~vweaver/projects/perf_events/faq.html) provides the ability to do interval measurements while simultaneously supporting per-process virtualization and virtual 64-bit counters. The primary downside of this approach (from my admittedly narrow perspective) is that this virtualization means that my low-overhead user-mode RDPMC instructions no longer provide full full performance counter information. A kernel call (costing >>1000 cycles) is required to get the full 64-bit virtualized result, and that level of overhead limits my ability to test pieces of code.

Sakthivel_S_ · ‎11-21-2016

Hi Team,

Thanks for you valuable earlier inputs, it helps a lot me to go ahead for my PMU programming.Now i need a clarification about caches miss and hits.

I am running my process in an environment that make sure of thread pinning (will always run on a particular logical core and never loose/free it).

Now I am trying to program 4 available programmable hardware counters for L1Instr.hit,L1Instr.Miss ,L2Instr.Hit and L2Instr.Miss respectively. Also I am expecting the results as below.

L1 Miss + L1 prefetch.Req = L2 Miss + L2 Hit .

Since, we don’t know L1 prefetch.Req (Or is there a way to know the L1 pre-fetch requests) , can I expect L1 Miss < L2Hit+L2 Miss?

But I am always getting L1 miss very higher than L2 Miss + L2 hit. Following is some sample output.

L1_HIT: 9025863325

L1_MISS: 391165758

L2_HIT: 271365176

L2_MISS: 32361408

My Archtype is: Intel 06_2CH

Values configure on PMUs are: 0x01410180, 0x01410280, 0x01411024 and 0x01412024 (pmu0,pmu1,pmu2 and pmu3).

Can you please help to get correct understanding on this?

-Sakthivel S

McCalpinJohn · ‎11-21-2016

I don't have any specific experience with the instruction cache events (or a solid understanding of the instruction cache microarchitecture), but from the information at https://download.01.org/perfmon/IVB/IvyBridge_core_V18.json I see that the description for the ICACHE.MISSES event (Event 0x80, Umask 0x02) includes "instruction cache, streaming buffer, and victim cache misses". The description of the L2_RQSTS.CODE_RD_* events only mention "instruction fetches". Maybe this is intended to include "streaming buffer" and "victim cache" misses, but maybe it is not... So one possible explanation is that the ICACHE.MISSES event includes hardware prefetches of instructions while the L2_RQSTS.CODE_RD_* events only count "demand" instruction fetches and not hardware prefetches of instructions.

It is probably possible to make some assumptions about how the L1 instruction cache prefetcher works and build some microbenchmarks that should (or should not) experience L1 instruction cache re-use and/or L1 instruction cache prefetches, and compare those results to what the counters report. This would be a lot of work unless you are already fluent with assembly language coding.

Sakthivel_S_ · ‎11-22-2016

Thanks Dr. I will try to explore it and get back if any clarification needed.From my understand, if a request is missed by instruction cache, it will counted ICACHE.MISSES and then will lookup further on victim cache or stream buffer.how ever the request is resolved by these lookup(not reaching L2 Instr. lookup), still ICACHE.MISSES will counts this as a L1Instr MISS.

for the ICACHE.MISSES event (Event 0x80, Umask 0x02) includes "instruction cache, streaming buffer, and victim cache misses".

from the above sentence, can i say every miss on instruction cache will be counted ICACHE.MISSES, then go for streaming buffer, and victim cache and if the request is missed there also will it count again ICACHE.MISSES and then it will reach L2_cache? (counted multiple times for a same request failed over multiple stages.) I am asking this , bcoz my l1 miss value is huge when compare to l2hit+l2miss.

-Sakthivel S

McCalpinJohn · ‎11-22-2016

From the values above, the ICACHE.MISSES value is only about 29% higher than the sum of the two L2_RQSTS.CODE_READ_* events. This is clearly not close, but it is not an unusually large difference either.

There are a number of possible explanations:

The ICACHE.MISSES event may increment every time the ICACHE makes a request to the L2 (even if that request is rejected by the L2), while the L2_RQSTS.CODE_READ_* events only increment when a request is accepted by the L2. A 29% reject rate seems high, but it is not implausible.
The ICACHE.MISSES event may increment for "streaming buffer" or "victim cache" misses that don't increment the L2_RQSTS.CODE_READ_* events. This may be a bug or a feature, depending on the (unknown) designer's (unknown) intent.
One or more of the events may simply be broken. An event may increment when it should not increment or it may fail to increment when it should. This can be either systematic or random, and can be due to errors in the core's instruction fetch mechanism, the "streaming buffer", the "victim cache", the ICACHE control logic, the L2 control logic, or the Performance Monitor Unit control logic.

Sakthivel_S_ · ‎11-23-2016

Ok. Is there any possibility to verify the counter values? Before i try some other measurements, i just wanted to get conformation on my readings. can you please help me with ideas to do that?

-Sakthivel S

McCalpinJohn · ‎11-24-2016

The only way to "verify" the counter values is to test them comprehensively. This involves creating a lot of test cases for which you know the answers in advance and checking to see if the measurements are in agreement. I have only vague ideas of how to do this for instruction cache accesses and misses.