Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Perf counters for measuring TLB miss rate


I want to measure following things for an application:

  1. TLB miss rate
  2. Number of cycles spent in Page Walks
  3. Runtime in number of cycles

I have an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz system. 

To calculate these I am using following perf counters:

  1. Total number of memory references ( X ) = mem_inst_retired.all_loads:u + mem_inst_retired.all_stores:u
  2. Total number of memory references that missed in TLB ( Y ) = 

mem_inst_retired.stlb_miss_loads:u + mem_inst_retired.stlb_miss_stores:u

  1. TLB miss rate = Y/X
  2. Number of cycles spent in Page Walks = dtlb_store_misses.walk_pending:u + dtlb_load_misses.walk_pending:u
  3. Runtime in number of cycles = cycles

I am confused between three parameters to count the total number of references that missed the TLB:

  1. dtlb_load_misses.miss_causes_a_walk + dtlb_store_misses.miss_causes_a_walk
  2. dtlb_load_misses.walk_completed + dtlb_store_misses.walk_completed
  3. mem_inst_retired.stlb_miss_loads + mem_inst_retired.stlb_miss_stores

However, when I ran the sequential array access of size 64MB. { arr[i] = i;} I am getting following values for above counters: (with THP disabled)

dtlb_store_misses.miss_causes_a_walk = 154771

dtlb_store_misses.walk_completed = 116499

mem_inst_retired.stlb_miss_stores = 15566

When I double the array size to 128 MB and then to 256 MB. These counters are also getting doubled approximately. Since, 64 MB array has 16K pages, I see that mem_inst_retired.stlb_miss_stores is giving the closest value.

Also, I didn’t see any effect of Next-page prefetcher in this as mentioned in this post ( ). So, I suppose that my machine which has a SkyLake architecture, doesn’t have NPP.

Could you please let me know if I have chosen the right counters for my measurements?

Thanks in advance!

Best Regards,


0 Kudos
1 Reply
New Contributor III

In your Y/X ratio, the count in the denominator includes only load and store requests from retired instructions (the events are described to be counted at retirement). So it makes more sense to me to use the sum of mem_inst_retired.stlb_miss_loads + mem_inst_retired.stlb_miss_stores to count what you've described as "Total number of memory references that missed in TLB."

These events are counted together. For example, if a load retires and it missed in the STLB, the event counts of mem_inst_retired.all_loads and mem_inst_retired.stlb_miss_loads are incremented and, on SKL/SKX in particular, they are incremented by the same amount, which is 1.

The STLB is the last level TLB on SKL/SKX. A miss in the STLB doesn't trigger a page walk if there is already an outstanding speculative walk initiated by the NPP. Also, there is a possibility a miss in the STLB doesn't trigger a walk if it happens that the walk that is about to start got cancelled by the time the miss determination is completed. Otherwise, a miss in the STLB triggers a walk.

0 Kudos