Following command can be used to read raw hardware performance counter registers for an application.
perf stat -e r01A2,r08A2 <Application>
event code A2 with event mask 0x01 records resource related stall cycles
whereas with event mask 0x08 records cycles stalled due to no store buffer available.
My question is while earlier intel processors, broadwell family, supported event mask 0x02, 0x04 and 0x10 to measure few more kind of stalls, they are not shown to be supported by skylake family of processor. Still measuring on skylake family processor using command
perf stat -e r04A2,r02A2,r10A2 <Application> throws some number.
I would like to know whether those numbers correspond to earlier supported kind of stalls.
Unfortunately, there are a lot of possibilities here....
- The other Umasks may have the same meaning, but
- they are untested (or not comprehensively tested) on the new system, or
- they are inaccurate (maybe a little, maybe a lot, maybe in some special cases, maybe most of the time) in the new system, or
- it was realized that they were inaccurate in the previous system (and only the documentation for the new system was changed), or
- because of microarchitecture/implementation changes, counting the same event may imply something different (and potentially not useful or easily describable) in the new system, or
- because of planned future microarchitecture/implementation changes, the condition that the other Umasks measured has been moved to a different event, or
- other cases that I can't think of right now...
- The other Umasks may have different meanings, but
- the implementation of the new meaning is not intended for the public -- at least in this generation, or
- the implementation of the new meaning is untested (or incompletely tested), or
- the implementation of the new meaning is broken, or
- other cases that I can't think of right now....
I could probably come up with specific examples of each of these cases from either my analysis of Intel processors or my prior experience in the design teams at SGI, IBM, and AMD, but then I would have to spend too much time thinking about confidentiality issues....
When I see cases like these, I try to develop directed tests that generate a known (or otherwise measurable) amount of activity in the event, and test related events to see if counts with the various Umasks add up to all counts for this event or all counts for this activity measured by a different event.
Sometimes interpretation of the Umasks can be done quickly and unambiguously
- Umasks for different classes of AVX instructions on Haswell, reported at https://github.com/RRZE-HPC/likwid/wiki/FlopsHaswell, or
- Umasks for RS_FULL_STALL on KNL, reported at http://sites.utexas.edu/jdm4372/2018/01/22/a-peculiar-throughput-limitation-on-intels-xeon-phi-x200-knights-landing/). ;
Sometimes it can be quickly proven that the undocumented Umasks are broken (at least with respect to the activity for which the Umask used to be associated).
Sometimes the results don't make any sense, and you have to decide whether to keep on looking for a plausible & provable interpretation, or whether you need to move on to more productive work. I often choose poorly in these cases.