My question is whether I could configure the hardware counters to increment only for a particular process ID - so that the kernel does not interfere with the results. This is significant because when I run my application (which is want I would like to sample) as the kernel also generates significant samples of all the parameters (memory misses and instructions).
From my understanding this is what VTune does (correct me if I'm wrong). --- If for an Application A I set the sample after value for L2 Cache misses at 10000 and this is the application I want to optimize. Now when I run the application on a Linux box --- if another unrelated application X is also running and generating significant misses. There is a possibility - Application X could generate 9000 L2 misses and Application A generates only 1000 L2 misses. But because the 10000th miss is by Application A and the sample after value is 10000 the event counter of Application A for L2-misses gets incremented. Which is wrong rite?
generally speking you are right, when you are talking of the specific 10000-th event that will happened for process A and then all previous 9999 events will be recorder in favour of process A, inspite of the fact that some of them definitely happen in some other processes.
but sampling collection is based on statistic. in another words, if there is considerable big number of samples fallen on a paricular region of code, statisticlly it is correct to say that most of the events also happen there.
to provide you the correct "sample after" value the process named "calibration" is used.
p379 On IA-32 processors that support HT, the performance counters and associated model registers (MSRs) are extended to support HT. A subset of the performance monitors events allow the event counts to be qualifies by logical processors.
See also Using performance Metrics with Hyper-Threading Technologies chapter in this manual.
Since we'r on the topic I wanted to confirm two other things with you guys...
(1) In Sampling I presume this is what happens - Say we're sampling the L1-Load Miss Retired event. The counters get incremented on every L1 Load miss that gets retired. When the SAV(Sample After Value) is reached the processor is interrupted and the VTune Routine runs - which does two things - Increments its samples count for the L1 Load miss Event - and looks at the current active Instruction Pointer (to assign the event to that IP). Am I right?. So basically there will always be a skew right?. Since the IP is not going to be the one that caused the L1-Load miss. Or does something else go on where VTune is able to extract the exact Instruction Address of the L1-Load that retires?.
(2) In Hyper Threading (as Elad Says) you can count the events for each logical processor. I read through the manuals and it says that the lower 4 ESCR bits control when the counters count. What I dont understand is that it says that these bits control when the counters count (either when the Logical Proc 0 is in USR/OS mode or when Logical Proc 1 is in USR/OS mode). The wordings seem to mean that we dont completely shut off either of the Logical Procs. So suppose we turn on counting only when Logical Proc 0 is in USR mode. Then I assume that basically whenever Logical Proc 0 is in USR mode all events are counted (even those caused by Logical Proc 1). Am I right?.
regarding your question (1): we are reading the same sources, vol 3, don't we? There in 15.9.5 the "precise event-base sampling" described as following:
PEBS record is stored...whenever a counter overflow occurs. This record contains the arch. state of the processor( 8 general purpose registers, EIP and EFLAGS) at the time of the event that caused the counter to overflow....
Is L1-Load miss you are referencing is a precise event? If yes, then most likely you have an exact IP address.
regarding your question (2): have you experimented with VTune and get any evidence that events happening in one logic processor somehow influence the number of the events on another? personally I don't know what is the exact realization of the mechanism of HT. do you know? we can speculate... may be some registers and units inside the CPU are just doubled? or any other smart flashing buffers? Anyway if OS consider 2 HT logic CPUs as 2 different CPUs, why VTune users should not :-) ?
Yes, the VTune Analyzer's interrupt service routine (ISR) of the performance counter overflow interrupt normally just grabs the instruction return address off the stack and uses that as the causing instruction for the Event Based Sampling event. This means that in most cases there will be "skid" in identifying the instruction that caused the event. However, the Pentium 4 processor family and the Itanium processor family have certain events that are "precise" in which there is an extra register that the VTune Analyzer ISR can interrogate to exactly identify the causing instruction. In general, most Cache Miss and TLB miss events are precise. The way to tell if the event you want to use is precise is to go to the VTune Analyzer screen where events are selected, move the desired event into the "Selected Events" column, highlight the event and click on . If the event is precise there will be a enabling button in the edit window.