My question is exclusively about how VTune work with precise events. The difference between those two types in general is clear.
So, while VTune allows to collect both types of events (e.g. MEM_LOAD_UOPS_RETIRED.L1_HIT_PS and MEM_LOAD_UOPS_RETIRED.L1_HIT) what are pros and cons to collect one or another (in general, not L1_HIT). Is there trade off between skid and overhead?
Collecting both events simultaneously gives close but not equal results. Does it mean that one is closer to the actual number of events and it preferably should be used if number of events rather matter.
The advantage of precise events is reduced skid. Thus, the results are more accurate as far as *where* the event was triggered. Other than that, I am not aware of any additional overhead or accuracy of number of events.
Thank you for answer.
MrAnderson (Intel) wrote:
The advantage of precise events is reduced skid.
Is it general statement or a conformation that VTune takes advantages of precise version for events, for which such is available.
If there are no disadvantages in _PS event usage and benefit is in reduced skid then what is a point to have ordinary events provided in VTtune as well as precise? Wouldn't it be user friendly to have only one that solves problem better and hide another one. It would avoid confusion.
When I was looking through the VTune database tables for the Sandy Bridge EP I noticed that there appears to be no difference between the programming of the "precise" and "ordinary" versions of most events (with one caveat discussed below). Is there a difference?
The one consistent difference I saw was that the precise versions of the events were always listed as being limited to counters 0,1,2,3 (or fewer), even when the "ordinary" version of the event listed counters 0,1,2,3,4,5,6,7 as being usable. This difference is also apparent in the event definition files for the Ivy Bridge and Haswell processors. Is this meaningful?
To answer my own question -- yes there is a difference between "precise" events and their "ordinary" versions. The bits that go into the PerfEvtSel registers are the same, but for a "precise" event the hardware programs an additional MSR (IA32_PEBS_ENABLE), which causes the hardware to capture extra lots of information and place it in the "Debug Store" area. This extra information may be useless or may be incredibly useful, depending on what you are trying to analyze. (I don't typically use sampling-based performance monitor methodologies, so I have not fiddled with this myself.)
The documentation for PEBS is spread across many subsections of Chapter 18 of Vol 3 of the SW Developer's Manual. I found it necessary to read the descriptions in the chronological order (definitely not the same as the order of the sections in the chapter) to understand the terminology and descriptions. The "Debug Store" area is described in section 17.4, and some familiarity with this is also helpful in understanding the mechanisms by which the PEBS information is moved around.