I am running Linux on 32-nm Westmere core. I have concern with seemingly conflicting data on DTLB miss numbers from performance counters. I ran two sets of experiments with a random memory access test program (single-threaded) as follows:
Experiment (1): I counted the DTLB misses using following performance counter
DTLB_MISSES.WALK_COMPLETED ((Event 49H, Umask 02H)
Experimt (2) I counted the DTLB misses by summing up following two counter value
MEM_LOAD_RETIRED.DTLB_MISS (Event CBH, Umask 80H)
MEM_STORE_RETIRED.DTLB_MISS (Event 0CH, Umask 01H)
I expected the output of these experiments to be similar. However I found that numbers reported in ecperiment (1) is almost twice that of in experiment (2). I am at a loss why this is the case.
Can somebody help shed some light on this aparent discrepency?
This is a pretty hard question to answer. Probably I need to ask a few questions first.
How is your test configured? Usually when I think of random memory access tests trying to excercise the TLB, there are a few characteristics: 1) very large memory array (20x LLC?), 2) dependent load, linked list pointer chain, 3) each memory address is at least 1 page size away from the previous address, 4) the addresses are picked at random from the available list of 1 entry per page.
Is your test constructed like this? I'm confused by the reference to MEM_STORE_RETIRED.
I hve only a little time to spend on this forum, any extra info you can provide improves chances of getting a useful answer.
Thanks Pat for your kind time. I am using GUPS benchmark's (http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/) sequential version. It updates random location of very large array.
Let me simplyfy the question and let's forget about the STORES. I think my confusion stems from misunderstanding of what these two following events actually counts: (1) DTLB_LOAD_MISSES.WALK_COMPLETED (Event 08H, Umask 02H) and (2) MEM_LOAD_RETIRED.DTLB_MISS ((Event CBH, Umask 80H).
According to Table 19-11 of the manual http://download.intel.com/products/processor/manual/325384.pdf , the event (1) counts the DTLB misses (missed both levels of DTLB) that causes page-walk that completes (I beleive page walk may not complete if it's on mispeculated path due to O-O-O). Intel's performance tuning manuals suggests using this event to understand the effect of DTLB misses.
What confuses me is the event (2). According to the manual, this also counts DTLB misses but excludes software prefetches and faults, but includes both L1 and L2 DTLB misses (unlike event(1)).
I was surprised to find that measurement for event(1) is almost twice that of event(2). I made sure that pages are pre-faulted in and the application does not use software prefetches, so those can not explain this huge difference in measurement. So my question or concern is why these two numbers are so different? Am I understanding them wrong?
Any help is greatly appreciated.
I've run some test I have for measuring PMCs 0x08, 0x48 and 0xD0.. and upon SB and HW these stats are working. Unfortunately on IB, the PMC 0x08 and 0x49 don't appear to work.. they're broken. On IB, though, PMC 0xD0 does measure L2 TLB misses and I've verified it's functioning.. so you can use it.
Thanks Perfwise. However, let me confirm what I understand here from your comments. If I read you correctly what you are saying is that PMCs DTLB_LOAD_MISSES (08) and DTLB_STORE_MISSES (49) and MEM_UOPS_RETIRED (D0) all works fine on Sandy Bridge, Haswell but DTLB_LOAD_MISSES and DTLB_STORE_MISSES does not work on Ivy Bridge. However, MEM_UOPS_RETIRED works OK on Ivy Bridge. Is this correct? Can you share some insight why you think those two counters doe not work on IvyBridge?
Further, have you tested on Westmere ? Does these counters work on Westmere?
I say they work.. because I've got tests I've built to probe all these things. I measure the perf in the test.. and know whether what I'm seeing is real and quantifying the impact of the event to my codes. I also know the ISA and # of LDs and STs in this case per some fixed instruction count and verify that the corrolary events, like the STLB req, match the # of LDs in this case (all LDs in my test miss the TLB). The only way to know what events listed in the system prog guide are good or bad.. and there are bad events.. as I pointed out. On IB.. from the publicly avail events.. you can't get TLB info with what is provided.
I don't have a Westmere to test on.. so can't tell you.. but because it's based on SB.. I suspect you're good to go.. they probably work.