I am profiling the DTLB misses in Intel Skylake CPU.
However, the performance counters of DTLB misses do not seem to be precise from my benchmarks. I doubt that the speculative prefetching are counted by DTLB miss performance counters, like the following one:
DTLB_LOAD_MISSES.STLB_HIT:Loads that miss the DTLB and hit the STLB.
They appear to be several magnitudes of the expected number. They are only precise when I run a pointer chasing benchmark (dependency). Can anyone explain to me the meaning of the performance counter above?
Also, is there anyway to count the number of retired DTLB misses, so that it excludes the misses from speculation?
I do see retired STLB miss performance counter:
Are there similar performance counters for DTLB?
I have not had a chance to look at the DTLB miss events on Skylake, but the discussion in the forum thread at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring... for Haswell (Xeon E5 v3) systems may be relevant -- especially when you get down as far as comment # 9 (https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/593830#comment-1840629), where I finally realized that the "next page prefetcher" eliminates the overwhelming majority of TLB misses for contiguous access patterns.
Understanding the Haswell results required comparison between the DTLB_LOAD_MISSES (Event 0x08) and PAGE_WALKER_LOADS (Event 0xBC) results. The PAGE_WALKER_LOADS event is not listed in the events for Skylake, but there is no new event listed for 0xBC, so it is possible that the event still exists.