Solved: VTune: Relation between "Cycles of 1 port utilized" and "L1 Bound"

HarshVardhanKumar · ‎08-15-2021

The L1 bound in VTune user guide https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/l1-bound.html is defined as "how often machine was stalled without missing the L1 data cache. "

I assume this meant the overhead due to TLB (address translations). However, further in the next paragraph the guide says, "this metric value may be highlighted due to DTLB Overhead or Cycles of 1 Port Utilized issues.".

Now, it is understandable why DTLB Overhead may contribute to this. But I fail to understand why Cycles of 1 Port Utilized (dependency issues) will affect the L1 bound?

If there are data dependencies, the HW prefetcher more likely would've brought that data to the caches itself (again, no misses at L1).

If it is only due to computation dependencies (the data is still not ready), then it makes sense. Just wanted to make sure that this indeed is the case.

Thanks.

Dmitry_R_Intel1 · ‎08-16-2021

Let me post a description of the Cycles of 1 Port Utilized metric which has some hints into this question:

"This metric represents cycles fraction where the CPU executed total of 1 uop per cycle on all execution ports. This can be due to heavy data-dependency among software instructions, or oversubscribing a particular hardware resource. In some other cases with high Cycles of 1 Port Utilized and L1 Bound, this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful."

View solution in original post

Dmitry_R_Intel1 · ‎08-16-2021

Let me post a description of the Cycles of 1 Port Utilized metric which has some hints into this question:

"This metric represents cycles fraction where the CPU executed total of 1 uop per cycle on all execution ports. This can be due to heavy data-dependency among software instructions, or oversubscribing a particular hardware resource. In some other cases with high Cycles of 1 Port Utilized and L1 Bound, this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful."

HarshVardhanKumar · ‎08-17-2021

Hey, Thanks! I missed that line...

RaeesaM_Intel · ‎08-17-2021

Hi,

Thank you for accepting the solution provided by Dmitry .

If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Regards,

Raeesa