- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The L1 bound in VTune user guide https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/l1-bound.html is defined as "how often machine was stalled without missing the L1 data cache. "
I assume this meant the overhead due to TLB (address translations). However, further in the next paragraph the guide says, "this metric value may be highlighted due to DTLB Overhead or Cycles of 1 Port Utilized issues.".
Now, it is understandable why DTLB Overhead may contribute to this. But I fail to understand why Cycles of 1 Port Utilized (dependency issues) will affect the L1 bound?
If there are data dependencies, the HW prefetcher more likely would've brought that data to the caches itself (again, no misses at L1).
If it is only due to computation dependencies (the data is still not ready), then it makes sense. Just wanted to make sure that this indeed is the case.
Thanks.
- Tags:
- VTune
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me post a description of the Cycles of 1 Port Utilized metric which has some hints into this question:
"This metric represents cycles fraction where the CPU executed total of 1 uop per cycle on all execution ports. This can be due to heavy data-dependency among software instructions, or oversubscribing a particular hardware resource. In some other cases with high Cycles of 1 Port Utilized and L1 Bound, this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful."
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me post a description of the Cycles of 1 Port Utilized metric which has some hints into this question:
"This metric represents cycles fraction where the CPU executed total of 1 uop per cycle on all execution ports. This can be due to heavy data-dependency among software instructions, or oversubscribing a particular hardware resource. In some other cases with high Cycles of 1 Port Utilized and L1 Bound, this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful."
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for accepting the solution provided by Dmitry .
If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Regards,
Raeesa
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page