I wish to show that a certain optimization makes a program stall less due to improved row-buffer locality (DRAM). I looked at all the VTune metrics. There are a couple of metrics that talk about memory-bandwidth stalls and memory-bound, store-bound, etc.
I'm not sure which metric to use for my purpose. Any comments will be helpful.
For the server processors, the performance counters in the IMC support events that can be used to measure open-page and closed-page accesses. The nomenclature differs a bit from what I have seen elsewhere in the industry, but it is not hard to translate.
In the Xeon systems at the Texas Advanced Computer Center, we program the four IMC performance counters on each DDR4 channel to measure:
- CAS_COUNT.RD -- all DRAM read accesses
- CAS_COUNT.WR -- all DRAM write accesses
- ACT_COUNT.ALL -- all DRAM ACTIVATE commands
- PRE_COUNT.PAGE_MISS -- all DRAM pages closed due to row conflict
The formulas for converting these four values to the page hit and miss rates are included in the uncore performance monitoring reference manual for each server processor.
- Page Conflict ratio = PRE_COUNT.PAGE_MISS / (CAS_COUNT.RD + CAS_COUNT.WR)
- Page Empty ratio = (ACT_COUNT.ALL - PRE_COUNT.PAGE_MISS) / (CAS_COUNT.RD + CAS_COUNT.WR)
- Page Hit ratio = 1 - PageConflictRatio - PageEmptyRatio
These counters appear to be reliable on all the systems I have tested, but finding useful information in the values is challenging.