Understanding micro architecture Metrics in VTune

rbachkaniwala3 · ‎01-15-2024

I want to understand what the below metrics obtained from VTune micro architecture really mean?

What does 'How often the CPU stalls' mean? Does it mean CPU Time? Does it mean cycles?
The metrics are from the VTune user guide.
Clarification would be extremely helpful.

Front-end bound - a slots fraction where the processor’s
Front-End undersupplies its BackEnd.
Bad speculation - a Pipeline Slots fraction wasted due to
incorrect speculations
Core bound - how much Core non-memory issues were of
a bottleneck (Shortage in hardware compute resources, or
dependencies software’s instructions are both categorized
under Core Bound)
L1 bound - how often machine was stalled without missing
the L1 data cache.
L2 bound - how often machine was stalled without missing
the L2.

L3 bound - how often machine was stalled without missing
the L3 cache, or contended with a sibling Core.
Local DRAM - how often CPU was stalled on loads from
local memory
Remote cache - how often CPU was stalled on loads from
remote cache in other sockets
Remote DRAM - how often CPU was stalled on loads from
remote memory
Store bound - how often CPU was stalled on store operations.

It will also be helpful to know what Memory bandwidth metric means as well? Also, how bad is the latency incurred when memory bandwidth peak is reached?

Jennifer_D_Intel · ‎01-16-2024

This is a common question we get regarding microarchitecture exploration results. In the summary view, it shows the top 4 categories the instruction pipeline: front-end, back-end, bad speculation, and retiring. These are shown as percentage of pipeline slots, which should add up to 100%. Depending on the MUX (multiplexing) reliability, there may be some over/under.

There are some other nested metrics shown as a percentage of pipeline slots, and these values should add up to the parent value. For example, under Front-End Bound there are two metrics shown as pipeline slots: Front-End Latency and Front-End Bandwidth. These should add up to the main Front-End Bound value.

The metrics measured in clockticks are less precise and generally a percentage of the parent value. So if your workload is 50% memory bound, then it may be caused 100% by DRAM accesses.

More information on the pipeline is here: https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2024-0/top-down-microarchitecture-analysis-method.html

Here is an example of the microarchitecture exploration summary:

As for memory bandwidth, if it is high then the instructions may have trouble fetching data due to bandwidth limitations. You can see how this impacts latency by running a memory access analysis.

rbachkaniwala3 · ‎01-20-2024

My question is regarding Bottom-up uarch exploration.

I noticed that the %s do not add up of the subcategories to the category. For instance, a function is 0% DRAM bound, but the memory bandwidth is 13.7% and Memory latency is 67.7%, which adds up to 81.4%.

I am not sure how to make a sense these numbers.

I was also hoping to know if there's a way to find the CPU Time spent in stalling due to say for instance memory bandwidth i.e. is there a way to go from memory bandwidth 13.7 to the CPU Time due to memory bandwidth stall in a function?

Understanding micro architecture Metrics in VTune

Intel VTune™ Profiler