Solved: what MEM_UOPS_RETIRED:ALL_LOADS represent on Broadwell

Jin__Chao · ‎11-07-2017

I am trying to figure out how much data are read and written from memory using MEM_UOPS_RETIRED:ALL_LOADS, but I am not sure what MEM_UOPS_RETIRED:ALL_LOADS represent exactly on Broadwell

Assume MEM_UOPS_RETIRED:ALL_LOADS=5,543,619,579, wondering how much data are actually transferred between cache and memory?

5,543,619,579 bytes or 5,543,619,579 x 64 bytes ? or other answer?

Thank you very much in advance! Jin

McCalpinJohn · ‎11-08-2017

There is no way to get the amount of data traffic from the MEM_UOPS_RETIRED events because they only count the number of accesses and not the size of each access.

Even if you knew that all loads were the same size, the specific event MEM_UOPS_RETIRED.ALL_LOADS would only tell you the amount of data loaded from the L1 Data Cache to the core. If you want the amount of data transferred from the DRAM memory to the caches, the most reliable measurements will come from the memory controller counters in the uncore. These can be significantly less convenient to use, depending on your hardware and software environment.

You can get an approximation to the amount of data loaded from memory to the caches using the OFFCORE_RESPONSE event. This is a core hardware performance counter event, but it requires programming an additional register to specify exactly what you want to count. The programming of this extra register requires software support from your OS, and understanding which bit fields need to be set is quite a challenge. The best way to figure out how to use these events is to start with the examples provided for your processor at https://download.01.org/perfmon/ or in the tables for your processor in Chapter 19 of Volume 3 of the Intel Architectures Software Developer's Manual (Intel document 325384). The description of the bits in the auxiliary off-core response register are in the sections of Chapter 18 (in the same document) that have "off-core response" in the title. Understanding how to use these events typically required both the explanation in Chapter 18 and the examples in Chapter 19 (or at https://download.01.org/perfmon/).

View solution in original post

McCalpinJohn · ‎11-08-2017

There is no way to get the amount of data traffic from the MEM_UOPS_RETIRED events because they only count the number of accesses and not the size of each access.

Even if you knew that all loads were the same size, the specific event MEM_UOPS_RETIRED.ALL_LOADS would only tell you the amount of data loaded from the L1 Data Cache to the core. If you want the amount of data transferred from the DRAM memory to the caches, the most reliable measurements will come from the memory controller counters in the uncore. These can be significantly less convenient to use, depending on your hardware and software environment.

You can get an approximation to the amount of data loaded from memory to the caches using the OFFCORE_RESPONSE event. This is a core hardware performance counter event, but it requires programming an additional register to specify exactly what you want to count. The programming of this extra register requires software support from your OS, and understanding which bit fields need to be set is quite a challenge. The best way to figure out how to use these events is to start with the examples provided for your processor at https://download.01.org/perfmon/ or in the tables for your processor in Chapter 19 of Volume 3 of the Intel Architectures Software Developer's Manual (Intel document 325384). The description of the bits in the auxiliary off-core response register are in the sections of Chapter 18 (in the same document) that have "off-core response" in the title. Understanding how to use these events typically required both the explanation in Chapter 18 and the examples in Chapter 19 (or at https://download.01.org/perfmon/).

Jin__Chao · ‎11-08-2017

Thank you for your quick advice!

I have a further question on OFFCORE_RESPONSE events.

I saw some people calculates memory bandwidth utilization for Ivy Bridge using

64 * (OFFCORE_RESPONSE_0:L3_MISS_LOCAL + OFFCORE_RESPONSE_0:L3_MISS_REMOTE) / time

Wondering if this formula still works on Broadwell?

McCalpinJohn · ‎11-09-2017

I don't see any place where events with those exact names are defined....

The OFFCORE_RESPONSE event should provide a good estimate of the DRAM read traffic by:

setting the "request type" bits for demand data reads, demand data RFOs, demand Ifetch, prefetch data read, prefetch RFO, prefetch L3 data read, prefetch L3 RFO,
setting the "supplier information" bits for local DRAM, L3 miss to remote DRAM, and "No Supplier Info available",
setting the "snoop response" bits for "snoop none", "snoop not needed", "snoop miss", and "snoop no forward".

These are all described in Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Manual, in the section on Haswell processors. Be sure to note the difference in the supplier information bits for the Haswell client and Haswell Xeon E5 processors. I did not see anything that suggested that Broadwell offcore response events are different than on Haswell.

Jin__Chao · ‎11-09-2017

Many thanks, John!

I am going through the developer manual now.

The final question in this thread is how to calculate how many FLOPs are executed?

I assume the answer is to add the following counters according to their width?

FP_ARITH_INST_RETIRED.SCALAR_DOUBLE
FP_ARITH_INST_RETIRED.SCALAR_SINGLE
FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE
FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE
FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE
FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE
FP_ARITH_INST_RETIRED.SCALAR
FP_ARITH_INST_RETIRED.PACKED
FP_ARITH_INST_RETIRED.SINGLE
FP_ARITH_INST_RETIRED.DOUBLE

I am actually trying to calculate Operational Intensity (FLOPs/Byte) for applications.

McCalpinJohn · ‎11-10-2017

The "normal" FLOP counts for these events is given in the Chapter 19 of Volume 3 of the Intel Architectures SW Developer's Manual or in the performance counter event listings at https://download.01.org/perfmon/SKX/skylakex_core_v1.06.json

The counters are set up so that an FMA instruction will increment the counter twice, so multiplying by the width will give the expected FLOPS value. The downside of this convention is that it makes it a bit harder to determine arithmetic intensity in terms of instruction counts -- you will need to re-compile with the "-no-fma" flag and look at the difference in counts between the original and no-fma cases to determine how many FMA instructions were used.

ZWang45 · ‎11-13-2017

The other option is the Intel PCM (https://github.com/opcm/pcm), which directly reads the performance counters at the memory controller. I have tested PCM on Broadwell, and the number is accurate.