- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The May 2018 Combined SDM, Chapter 19, Section 2 and Section 6 list the performance counters for skylake and haswell, respectively.
Under section 2 you will find the following 8 events:
Event Umask
Number Value
CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2
CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4
CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8
.
.
.
CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256
CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512
Their description reads: "Counts loads when the latency from first dispatch to completion is greater than <X> cycles." for the correspoding value of X; 2, 4, 8, etc. In particular, there is no indication in the description that these counters measure randomly sampled memory loads. In fact, as stated I would expect a precise count of these events up to skidding in perf record.
Under section 6, among others, you will find:
Event Umask
Number Value Event Mask Mnemonic Description
CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY Randomly sampled loads whose latency is above a user defined threshold. [Specify threshold in MSR 3FAH]
My question is: Can the "MEM_TRANS_RETIRED.LOAD_LATENCY" be used to emulate the former 8 performance counters showing up for Skylake, or are the semantics as stated in the description correct thus prohibiting this emulation by proxy?
I am aware that the Events and Umask are the same, but I am unsure if the implementation of these in hardware are consistent across haswell and skylake. I would like to get an official answer from Intel.
Thank you.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But seriously. an answer would be nice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To my knowledge the implementation for these latency events is similar on all microarchitectures - they randomly select loads to track.
One simple way to check is to collect MEM_INST_RETIRED.ALL_LOADS_PS and MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 events at the same time. You should see that MEM_INST_RETIRED.ALL_LOADS_PS will have much lower count.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The reason I am asking this seemingly trivial, pedantic, useless trivia sounding matter is because the Linux kernel assumes the answer to that question is "Yes, the semantics are the same and emulation as stated is correct." And it has code which makes use of this.
The statement " implementation for these latency events is similar on all microarchitectures" is exactly the problem. Which is why I am asking an extremely pedantic question. Your suggestion at a solution is frustrating because it tells me that you really did not look into my question; that approach is literally impossible to do to answer my question.
The "GT" counters are not available on Haswell....hence, any suggestion involving their use is out of the solution. I do not own a skylake based machine...even if I did, whatever the results would be it would tell me that for Skylake specifically, these two counters were or were not counted with random load samples....which would be useful in telling Intel to make your descriptions more precises one way or the other. Whatever the result either the "GT" description would have to change from implying "exact counts" to explicitly stating "random samples", or the retired load latency event would have to change from "random loads" to "exact counts". Based on what I've been told so far, there are no other logical options left.
This is a question for your engineers....after you have found out the answer...could you please for the sake of all that is holy, correct, exact, precise and trustworthy update the bloody manual?
Thank you.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page