- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I'm trying to measure TLB misses with the following counters:
DTLB_LOAD_MISSES.ANY
MEM_LOAD_RETIRED.DTLB_MISS
The second one gives more misses than the first one. And also the first one gives more misses (approximately 2 times) than the expected misses. What can be the possible reasons? Is the first one counting 2 times per miss for first level miss and second level miss? The machine I'm using is Xeon L5520. Any help is appreciated.
Cheers,
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello cagribal,
I ran a test to check the counters.
The test is a 'read memory bandwidth' test.
I start 1 thread/cpu and each thread reads a 40MB array using a 64 byte stride for 10 seconds.
I would expect 1 DTLB miss per page. Each page is 4096 bytes.
It takes 64 loads to cover a page (64 loads = 4096 page size/64 stride).
Here is what I counted in 1 of the 10 seconds.
[cpp]DTLB_LOAD_MISSES.ANY 556,938 556,991 560,425 556,418 MEM_LOAD_RETIRED.DTLB_MISS 532,471 513,618 524,524 526,461 MEM_INST_RETIRED.LOADS 37,694,658 38,354,887 36,165,843 34,506,563 UNC_LLC_LINES_IN.ANY 133,367,850 DTLB_MISSES.WALK_COMPLETED 558,674 558,890 565,344 559,734 a. loads/DTLB_miss, row3/row1 67.68 68.86 64.53 62.02 b. loads/DTLB_miss, row3/row2 70.79 74.68 68.95 65.54 c. loads/DTLB_miss, row3/row5 67.47 68.63 63.97 61.65 d. LLC_misses/DTLB_miss, row4/sum(row2) 63.60 e. loads/LLC_miss, sum(row3)/row4 1.10 [/cpp]
The raw data is in rows 1-5.
I compute how many loads/DTLB_miss in rows a-d.
The loads/dtlb_miss is close to the expected 64. I ran the test on my work laptop which has tons of stuff running on it.
Row d. shows the LLC (Last level cache) misses / dtlb_miss. This is very close to 64 and is probably the best measure (since most of the LLC misses are due to my read memory bw test case).
So... in conclusion... I don't see overcounting. Certainly not 2x times too many DTLB misses.
Can you tell us more about your expected count and methodology?
Pat
I ran a test to check the counters.
The test is a 'read memory bandwidth' test.
I start 1 thread/cpu and each thread reads a 40MB array using a 64 byte stride for 10 seconds.
I would expect 1 DTLB miss per page. Each page is 4096 bytes.
It takes 64 loads to cover a page (64 loads = 4096 page size/64 stride).
Here is what I counted in 1 of the 10 seconds.
[cpp]DTLB_LOAD_MISSES.ANY 556,938 556,991 560,425 556,418 MEM_LOAD_RETIRED.DTLB_MISS 532,471 513,618 524,524 526,461 MEM_INST_RETIRED.LOADS 37,694,658 38,354,887 36,165,843 34,506,563 UNC_LLC_LINES_IN.ANY 133,367,850 DTLB_MISSES.WALK_COMPLETED 558,674 558,890 565,344 559,734 a. loads/DTLB_miss, row3/row1 67.68 68.86 64.53 62.02 b. loads/DTLB_miss, row3/row2 70.79 74.68 68.95 65.54 c. loads/DTLB_miss, row3/row5 67.47 68.63 63.97 61.65 d. LLC_misses/DTLB_miss, row4/sum(row2) 63.60 e. loads/LLC_miss, sum(row3)/row4 1.10 [/cpp]
The raw data is in rows 1-5.
I compute how many loads/DTLB_miss in rows a-d.
The loads/dtlb_miss is close to the expected 64. I ran the test on my work laptop which has tons of stuff running on it.
Row d. shows the LLC (Last level cache) misses / dtlb_miss. This is very close to 64 and is probably the best measure (since most of the LLC misses are due to my read memory bw test case).
So... in conclusion... I don't see overcounting. Certainly not 2x times too many DTLB misses.
Can you tell us more about your expected count and methodology?
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Pat,
The test confused me. There is much I can't understand.
[cpp]a. loads/DTLB_miss, row3/row1 67.68 68.86 64.53 62.02 b. loads/DTLB_miss, row3/row2 70.79 74.68 68.95 65.54 c. loads/DTLB_miss, row3/row5 67.47 68.63 63.97 61.65 [/cpp]
The first column is the same, but the second column is different. I don't know why is this.
And what's the test's program? could I get the code? I want to do it by myself.
GHui
The test confused me. There is much I can't understand.
[cpp]a. loads/DTLB_miss, row3/row1 67.68 68.86 64.53 62.02 b. loads/DTLB_miss, row3/row2 70.79 74.68 68.95 65.54 c. loads/DTLB_miss, row3/row5 67.47 68.63 63.97 61.65 [/cpp]
The first column is the same, but the second column is different. I don't know why is this.
And what's the test's program? could I get the code? I want to do it by myself.
GHui
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello GHui,
Yeah, the table is not so clear.
Here is what it should look like:
[cpp] core0 core1 core2 core3 a. loads/DTLB_miss(row3/row1) 67.68 68.86 64.53 62.02 b. loads/DTLB_miss(row3/row2) 70.79 74.68 68.95 65.54 c. loads/DTLB_miss(row3/row5) 67.47 68.63 63.97 61.65 [/cpp]
So all 3 rows are "loads/DTLB_misses" but computed from different quantities.
The test program is my 'id_cpu' utility. I don't have approval to release it.
But it should be relatively easy to reproduce the results with any 64 byte stride (justtouch eachcache line), read memory, with a 40 MB array.
Pat
Yeah, the table is not so clear.
Here is what it should look like:
[cpp] core0 core1 core2 core3 a. loads/DTLB_miss(row3/row1) 67.68 68.86 64.53 62.02 b. loads/DTLB_miss(row3/row2) 70.79 74.68 68.95 65.54 c. loads/DTLB_miss(row3/row5) 67.47 68.63 63.97 61.65 [/cpp]
So all 3 rows are "loads/DTLB_misses" but computed from different quantities.
The test program is my 'id_cpu' utility. I don't have approval to release it.
But it should be relatively easy to reproduce the results with any 64 byte stride (justtouch eachcache line), read memory, with a 40 MB array.
Pat

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page