Solved: Measuring the fraction of time that memory has stalled the processor pipeline

Zhu_G_ · ‎07-09-2015

Hi Community!

I am trying to find the fraction of time processor spent waiting for memory (including cache access and memory access) using following event on my Westmere machine.

(RESOURCE_STALLS.STORE + RESOURCE_STALLS.LOAD) / CPU_CLK_UNHALTED.REF

but the equation seems rather unstable. It ranges from Xe-01 to Xe-04 (X is single number from 1 to 9) for a simple benchmarking run.

Why is this? Is there any other accurate way to do this?

McCalpinJohn · ‎07-22-2015

I don't think that there is any "accurate" way to measure processor stalls due to waiting for memory. A big part of the problem is coming up with a sufficiently precise definition of "stall", and "waiting for memory".

"Stalls" can happen at many different places in the processor pipeline, including instruction fetch, instruction decode, instruction issue, instruction dispatch, instruction execution (including address translation), and instruction retirement. For a single piece of code, it is typically the case that stalls in most of these locations are results, not causes, so you have to understand the interaction of the code with the processor pipeline well enough to identify which (if any) of these places correspond to the actual cause of the stall.

Once you have found a specific place in the pipeline to look, you need to decide what to do with stalls that have multiple overlapping causes. This is quite common in practice -- you might have a 10 cycle stall while loading data from L2, but 6 of those cycles might be overlapped with some other stall condition (such as a dependent operation latency). So should you blame the memory for all 10 cycles or just the 4 cycles that were not overlapped with another stall condition?

The documentation in Volume 3 of the Software Developer's Manual says that the performance counters used above (RESOURCE_STALLS.LOAD and RESOURCE_STALLS.STORE) are related to the load and store buffers in the core. These buffers handle the interaction of memory references between the core and the L1 Data Cache. A different set of "Line Fill Buffers" is used to handle L1 Data Cache misses. A code can run out of buffers on either side of the L1 Data Cache, so you need to figure out which set of buffers is actually running out of entries first and ignore the other set.

Some of this is discussed in Chapter 2 and Appendix B of the Intel Optimization Reference Manual (document 248966).

View solution in original post

McCalpinJohn · ‎07-22-2015

I don't think that there is any "accurate" way to measure processor stalls due to waiting for memory. A big part of the problem is coming up with a sufficiently precise definition of "stall", and "waiting for memory".

"Stalls" can happen at many different places in the processor pipeline, including instruction fetch, instruction decode, instruction issue, instruction dispatch, instruction execution (including address translation), and instruction retirement. For a single piece of code, it is typically the case that stalls in most of these locations are results, not causes, so you have to understand the interaction of the code with the processor pipeline well enough to identify which (if any) of these places correspond to the actual cause of the stall.

Once you have found a specific place in the pipeline to look, you need to decide what to do with stalls that have multiple overlapping causes. This is quite common in practice -- you might have a 10 cycle stall while loading data from L2, but 6 of those cycles might be overlapped with some other stall condition (such as a dependent operation latency). So should you blame the memory for all 10 cycles or just the 4 cycles that were not overlapped with another stall condition?

The documentation in Volume 3 of the Software Developer's Manual says that the performance counters used above (RESOURCE_STALLS.LOAD and RESOURCE_STALLS.STORE) are related to the load and store buffers in the core. These buffers handle the interaction of memory references between the core and the L1 Data Cache. A different set of "Line Fill Buffers" is used to handle L1 Data Cache misses. A code can run out of buffers on either side of the L1 Data Cache, so you need to figure out which set of buffers is actually running out of entries first and ignore the other set.

Some of this is discussed in Chapter 2 and Appendix B of the Intel Optimization Reference Manual (document 248966).