I am measuring number of walk cycles of an application on an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz machine. However, the number of walk cycles obtained are more than number of cycles.
Am I using wrong counters to measure walk cycles?
Or, these walk cycles also include the walk caused due to prefetcher? In that case how do I measure only the demand walk cycles?
Any hint would be highly appreciated.
Thanks in advance!
Starting in the SKL processor, there are two Page Table Walkers per core (Intel Optimization Reference Manual section 2.3.3, document 248966-043), and it looks like you are seeing both of them in use most cycles -- averaging 1.5 load miss walks pending plus 0.2 store miss walks pending over the full execution time.
I don't think I have tested this on SKX, but in the past these performance counter events only counted activity due to demand references -- not activity due to the next-page-prefetcher.
Based on the definitions of these events in Tables 19-6 of the Intel SWDM Volume 3 (document 325384-073), the DTLB_LOAD_MISSES.WALK_ACTIVE event counts cycles in which each least one Page Miss Handler (PMH) is active, while DTLB_LOAD_MISSES.WALK_PENDING increments by the number of PMHs that are active in each cycle. Your results show:
With a little more magic middle-school algebra, I think I derived bounds on the breakdown of activity by cycle. There are six possible categories of activity and only five data items, so bounds are the most one can hope for....
|PMH0 activity||PMH1 activity||% of time with minimum overlap of LD and ST TLB misses||% of time with maximum overlap of LD and ST TLB misses|