Load_Latency performance counter ambiguity

Chronus_Taizen · ‎07-18-2018

The May 2018 Combined SDM, Chapter 19, Section 2 and Section 6 list the performance counters for skylake and haswell, respectively.

Under section 2 you will find the following 8 events:

Event Umask

Number Value

CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2

CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4

CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8

.

CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256

CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512

Their description reads: "Counts loads when the latency from first dispatch to completion is greater than <X> cycles." for the correspoding value of X; 2, 4, 8, etc. In particular, there is no indication in the description that these counters measure randomly sampled memory loads. In fact, as stated I would expect a precise count of these events up to skidding in perf record.

Under section 6, among others, you will find:

Event Umask

Number Value Event Mask Mnemonic Description

CDH 01H MEM_TRANS_RETIRED.LOAD_LATENCY Randomly sampled loads whose latency is above a user defined threshold. [Specify threshold in MSR 3FAH]

My question is: Can the "MEM_TRANS_RETIRED.LOAD_LATENCY" be used to emulate the former 8 performance counters showing up for Skylake, or are the semantics as stated in the description correct thus prohibiting this emulation by proxy?

I am aware that the Events and Umask are the same, but I am unsure if the implementation of these in hardware are consistent across haswell and skylake. I would like to get an official answer from Intel.

Thank you.

Chronus_Taizen · ‎08-02-2018

But seriously. an answer would be nice.

Dmitry_R_Intel1 · ‎08-03-2018

To my knowledge the implementation for these latency events is similar on all microarchitectures - they randomly select loads to track.

One simple way to check is to collect MEM_INST_RETIRED.ALL_LOADS_PS and MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 events at the same time. You should see that MEM_INST_RETIRED.ALL_LOADS_PS will have much lower count.

Chronus_Taizen · ‎08-05-2018

The reason I am asking this seemingly trivial, pedantic, useless trivia sounding matter is because the Linux kernel assumes the answer to that question is "Yes, the semantics are the same and emulation as stated is correct." And it has code which makes use of this.

The statement " implementation for these latency events is similar on all microarchitectures" is exactly the problem. Which is why I am asking an extremely pedantic question. Your suggestion at a solution is frustrating because it tells me that you really did not look into my question; that approach is literally impossible to do to answer my question.

The "GT" counters are not available on Haswell....hence, any suggestion involving their use is out of the solution. I do not own a skylake based machine...even if I did, whatever the results would be it would tell me that for Skylake specifically, these two counters were or were not counted with random load samples....which would be useful in telling Intel to make your descriptions more precises one way or the other. Whatever the result either the "GT" description would have to change from implying "exact counts" to explicitly stating "random samples", or the retired load latency event would have to change from "random loads" to "exact counts". Based on what I've been told so far, there are no other logical options left.

This is a question for your engineers....after you have found out the answer...could you please for the sake of all that is holy, correct, exact, precise and trustworthy update the bloody manual?

Thank you.