I am working on CMP scheduling and need to obtain some inforamtion about the behaviour of multi-threaded applications executed on Intel CPUs. One CPU considered in my study is Intel Core 2 Duo. I have some question and would be thankful if somebody answer me.
1- Could you tell me how many performance counter registers exist in Core 2 Duo?
2- I am going to know L1 cache misses of each thread of a multi-threaded application for a given period of its runtime. Take a four-threaded application as an example: t0, t1, ... t3. My goal is to obtain the number of L1 cache misses for threads 0, 1, ... 3 which happen in the first second of the application's runtime. Could you help me how this task should be done? By the way, I use Linux Perf Tool to read performance counters.
3- As you know, the number of register specified to reading perormance counters is limited. Consider N programs ( N is larger than the number of perfomance counter registers) which should be monitored, and suppose we need the number of LLC misses and executed instructions of each program. Having said that, the number of programs are larger than the number of performance counter registers. In this situation, could you tell me how performance counters are read for each program?
Sorry to bother you
1. Information on the available performance counters is included in Chapter 18 of Volume 3 of the Intel Software Developer's Manual (document 325384, revision 053, January 2015). In some cases the documentation will provide the information directly and in other cases it will point to specific queries and result fields using the CPUID instruction.
2. I don't know of an existing interface that will do this directly, so you might need to create a program using the Linux "perf events" interface to implement this specific functionality. If the four threads are not guaranteed to be running on different cores, then you need to combine whole-system monitoring with per-thread virtualization, and that may not be trivial.
3. The standard mode of operation of the Linux perf infrastructure is to "virtualize" the counters by saving and restoring both the counter programming and the counter counts at any context switch. Thus each process appears to have access to all of the hardware counters, with the counters incrementing only when that process is running. This should be able to scale to thousands of processes with minimum overhead.
There are 2 main ways to collect counters. 1) "counting mode" and 2) "sampling mode".
In counting mode you read the counters at some point and subtract the current value from the previous value. If 'some point' is a context switch then you have your 'counters values per process'. Maybe Dr McCalpin is correct about the virtualizing of the counters on linux... I just don't know. If the counters aren't virtualized then it will be very difficult to get the counters per process.
In sampling mode, the counters are setup such that they generate an interrupt every X occurrences. Say maybe every 100,000 L3 misses. Then the monitoring tool sees what process/thread was running when the counter generated an interrupt. This method, if you generate enough samples, can then the 'samples' average out to a very good model of who actually generated the L3 misses. Does this make sense? In this way you can model the whole system (and all the processes/threads) running on the system.
Linux "perf" virtualizes the performance counters by default. This virtualization serves (at least?) three purposes:
- Periodically reading the counters and adding the deltas to a 64-bit value in memory allows the 48-bit counters to be expanded to (virtual) 64-bit counters so overflows won't be a problem. (Even if a counter increments by 32 per cycle, at 4 GHz it would take over 4 years to overflow in 64 bits.)
- The virtual counters are held in the process context, so they are saved and restored on context switches. This allows counts to follow a process if it is rescheduled to a different core, and allows some counter events to be useful on a time-shared system.
- Periodically saving and restoring counter values and counter programming is used to allow multiplexing of events -- allowing more events to be measured during a single run than would be possible with dedicated counters. The resulting accumulated counts are scaled up to provide estimates of what the counts would have been if each event had been counted for the duration of the process. The statistical error introduced is typically small, but it can be hard to identify cases that generate large errors (e.g., periodic behavior in the program that matches the average period of the multiplexing of the counters).
Counting in "system-wide" mode requires either running as "root" or setting the kernel.perf_event_paranoid value to 0 (or a negative value). We use the latter on the "compute nodes" of our large systems because they are allocated to a single user at a time, so there is no concern with "leaking" information from other user's processes.
By default the Linux "perf" system blocks access to all "uncore" counters. Presumably this is because it is not possible (in general) to attribute events in the various uncore units to specific user processes. (There are a few exceptions to this in the Intel hardware, but not enough to change the basic validity of the concept) This default setting can also be overridden by root (no surprise) or by setting the kernel.perf_event_paranoid value to zero (or a negative value).
The Linux "perf" system also supports sampling, but I have never learned how to use it in that mode.