- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am learning how to use TMA to profile my program.
After reading a lot of materials, I understand that Intel TMA separates sampling cycles into three parts (simplifying by not considering branch prediction): retirement, frontend stall, and backend stall. The corresponding events are UOPS_RETIRED.SLOTS, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE, and TOPDOWN.BACKEND_BOUND_SLOTS in the ICELAKE CPU.
However, I can't find any materials that explain how these three parts are actually counted in the processor pipeline.
Take the following 4-stage pipeline diagram as an example, assuming the pipeline width is 1:
Cycle 1 2 3 4 5 6 7 8 9 10 Inst1: F D E R Inst2: F D E E E R Inst3: F D - - E R Inst4: F F - D E R Inst5: - F - D E R
What should the values for retirement, frontend_stall, and backend_stall be?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
All the metrics are calculated based on PMU events in TMA. You can refer to the Perfmon below:
https://github.com/intel/perfmon
For the events counting you mentioned, you can see the 'Intel® 64 and IA-32 Architectures Software Developer’s Manual'

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page