- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
I am using VTune to measure the different levels of cache hits and misses (Load). I assumed L2_MISS = L3_HIT + L3_MISS (similarly for L1 and L2) but this does not seem to satisfy from the output below?
Config : Intel Core i3-5005u + Windows 10
CPU
Name: Intel(R) Core(TM) Processor code named Broadwell
Frequency: 2.0 GHz
Logical CPU Count: 4
Elapsed Time: 60.004s
CPU Time: 25.576s
CPI Rate: 1.641
Total Thread Count: 4
Paused Time: 0s
Hardware Events
Hardware Event Type Hardware Event Count Hardware Event Sample Count Events Per Sample
BACLEARS.ANY 223,106,693 97 100003
BR_MISP_RETIRED.ALL_BRANCHES_PS 64,401,449 7 400009
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE 1,497,344,919 651 100003
CPU_CLK_UNHALTED.REF_TSC 51,034,000,000 25,517 2000000
CPU_CLK_UNHALTED.REF_XCLK 2,645,079,350 1,150 100003
CPU_CLK_UNHALTED.THREAD 51,314,000,000 25,657 2000000
CPU_CLK_UNHALTED.THREAD_P 47,242,070,863 1,027 2000003
CYCLE_ACTIVITY.STALLS_L1D_MISS 13,616,020,424 296 2000003
CYCLE_ACTIVITY.STALLS_L2_MISS 10,350,015,525 225 2000003
CYCLE_ACTIVITY.STALLS_MEM_ANY 20,332,030,498 442 2000003
CYCLE_ACTIVITY.STALLS_TOTAL 29,992,044,988 652 2000003
INST_RETIRED.ANY 31,262,000,000 15,631 2000000
INST_RETIRED.PREC_DIST 30,130,045,195 655 2000003
INST_RETIRED.X87 0 0 2000003
INT_MISC.RECOVERY_CYCLES 276,000,414 6 2000003
ITLB_MISSES.STLB_HIT 50,601,518 22 100003
ITLB_MISSES.WALK_COMPLETED 85,102,553 37 100003
ITLB_MISSES.WALK_DURATION 2,884,286,526 1,254 100003
L1D.REPLACEMENT 1,518,002,277 33 2000003
L1D_PEND_MISS.FB_FULL 46,000,069 1 2000003
L1D_PEND_MISS.PENDING 33,810,050,715 735 2000003
L2_RQSTS.RFO_HIT 55,200,828 12 200003
LD_BLOCKS.NO_SR 0 0 100003
LD_BLOCKS.STORE_FORWARD 39,101,173 17 100003
LD_BLOCKS_PARTIAL.ADDRESS_ALIAS 71,302,139 31 100003
LSD.CYCLES_4_UOPS 138,000,207 3 2000003
LSD.CYCLES_ACTIVE 92,000,138 2 2000003
LSD.UOPS 506,000,759 11 2000003
MACHINE_CLEARS.COUNT 2,300,069 1 100003
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM_PS 27,154,927 59 20011
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT_PS 10,585,819 23 20011
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS_PS 5,523,036 12 20011
MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS 565,816,974 246 100003
MEM_LOAD_UOPS_RETIRED.L1_HIT_PS 6,716,010,074 146 2000003
MEM_LOAD_UOPS_RETIRED.L1_MISS_PS 761,322,839 331 100003
MEM_LOAD_UOPS_RETIRED.L2_HIT_PS 434,713,041 189 100003
MEM_LOAD_UOPS_RETIRED.L2_MISS_PS 332,489,587 289 50021
MEM_LOAD_UOPS_RETIRED.L3_HIT_PS 287,620,750 250 50021
MEM_LOAD_UOPS_RETIRED.L3_MISS 9,200,644 4 100007
MEM_LOAD_UOPS_RETIRED.L3_MISS_PS 6,900,483 3 100007
MEM_UOPS_RETIRED.ALL_STORES_PS 5,888,008,832 128 2000003
MEM_UOPS_RETIRED.LOCK_LOADS_PS 262,218,354 114 100007
MEM_UOPS_RETIRED.SPLIT_LOADS_PS 4,600,138 2 100003
MEM_UOPS_RETIRED.SPLIT_STORES_PS 0 0 100003
MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS 108,103,243 47 100003
MEM_UOPS_RETIRED.STLB_MISS_STORES_PS 2,300,069 1 100003
Any help regarding this would be appreciated.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The estimates are still only good to about 25% with 4 "hardware event sample counts".
Multiplexing the counters over this many different counter sets adds a level of uncertainty that cannot easily be quantified.
If you restrict the counters to a single set that captures the values that you are trying to compare, the results should be reliable enough for you to decide whether the counts are consistent. You only need three counters for this test:
- MEM_LOAD_UOPS_RETIRED.L2_MISS
- MEM_LOAD_UOPS_RETIRED.L3_HIT
- MEM_LOAD_UOPS_RETIRED.L3_MISS
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sakura,
Could you please share the vtune results with us so we can take a look into the issue about cache hits and misses.
Arun Jose
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a few suggestions:
1. Please use 'Limit PMU collection to counting' option to improve the accuracy
2. Please try to disable hardware prefetchers (through BIOS or MSR as described here: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors) if possible. The MEM_LOAD_UOPS_RETIRED events accound only for demand loads and if data was brought by prefetcher they won't increment
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is important to understand that VTune uses a sampling methodology and that VTune is multiplexing the counters across the various events.
If you look at the third column ("Hardware Event Sample Count") for MEM_LOAD_UOPS_RETIRED.L3_MISS and MEM_LOAD_UOPS_RETIRED.L3_MISS_PS, you will see that those events were only counted 4 times and 3 times, respectively. The "Hardware Event Count" in column 2 is not directly measured -- it is a scaled estimate based on the "Hardware Event Sample Count", the "Events Per Sample" value, and the fraction of the execution time during which each performance counter event was active.
Using the same L3_MISS events as an example:
MEM_LOAD_UOPS_RETIRED.L3_MISS 9,200,644 4 100007
MEM_LOAD_UOPS_RETIRED.L3_MISS_PS 6,900,483 3 100007
dividing the "Hardware Event Count" by the "Hardware Event Sample Count" and then dividing by the "Events Per Sample" value gives exactly 23. This suggests that VTune was multiplexing 23 different performance counter event sets, and that each set was only being measured (approximately) 1/23rd of the time. Each of the "Hardware Event Counts" should be interpreted as having a relative uncertainty of (at least) 1/(Hardware Event Sample Count) -- i.e., 25% for MEM_LOAD_UOPS_RETIRED.L3_MISS and 33% for MEM_LOAD_UOPS_RETIRED.L3_MISS_PS.
If you want more precise estimates, you should limit the sampling to a much smaller number of counters.
The most precise numbers come from measuring a single set of events for the full duration of the program, rather than using a sampling methodology.
It is also true that the MEM_LOAD_UOPS_RETIRED events only count accesses due to demand loads, and not those due to activity of the L2 HW prefetchers. When the prefetchers are working well the L2 and L3 cache miss counts can be reduced substantially. This makes these events good for finding loads that don't get their data prefetched (and therefore have a much higher chance of causing stalls), but not good for estimating the total amount of traffic through the cache hierarchy. The L2_RQSTS events and the OFFCORE_RESPONSE events are more useful for getting an idea of the total traffic for various transaction types at each level of the cache hierarchy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried disabling the H/W Prefetchers using PCM
#include <iostream> #include <bitset> #include <vector> #include "cpucounters.h" #include "msr.h" constexpr uint64 MSR_NUM = 0x1A4U; int main(int argc, const char *argv[]) { PCM *m = PCM::getInstance(); if (m->program() != PCM::Success) { std::cout << "Failed to init PCM" << std::endl; return 1; } std::vector<std::shared_ptr<SafeMsrHandle>> MSR; for (int i = 0; i < m->getNumCores(); ++i) { if (m->isCoreOnline(int32(i))) { MSR.push_back(std::make_shared<SafeMsrHandle>(i)); } else { MSR.push_back(std::make_shared<SafeMsrHandle>()); } } uint64 val = 0x0FU; //Read MSR Value for (auto &msr : MSR) { if (!(msr->read(MSR_NUM, &val))) { std::cout << msr->getCoreId() << " error while read" << std::endl; }; std::cout << "Core : " << msr->getCoreId() << " \t " << "MSR Value : " << std::bitset<64>(val) << std::endl; } //Write MSR Value val = 0x0FU; //Disable H/W Prefetcher for (auto &msr : MSR) { if (!msr->write(MSR_NUM, val)) { std::cout << "error writing to MSR of core : " << msr->getCoreId() << std::endl; }; } std::cout << std::endl; //Read MSR Value for (auto &msr : MSR) { if (!(msr->read(MSR_NUM, &val))) { std::cout << msr->getCoreId() << " error while reading" << std::endl; }; std::cout << "Core : " << msr->getCoreId() << " \t " << "MSR Value : " << std::bitset<64>(val) << std::endl; } m->resetPMU(); return 0; }
Config #1 : Disabled H/W Prefetcher and Enabled 'Limit PMU collection to counting'
Hardware Events
Hardware Event Type Hardware Event Count Hardware Event Sample Count Events Per Sample Precise
BACLEARS.ANY 557,855,130 4 [Unknown]
BR_MISP_RETIRED.ALL_BRANCHES_PS 216,106,360 4 [Unknown]
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE 7,621,940,670 4 [Unknown]
CPU_CLK_UNHALTED.REF_TSC 181,953,469,200 4 [Unknown]
CPU_CLK_UNHALTED.REF_XCLK 9,097,798,600 4 [Unknown]
CYCLE_ACTIVITY.STALLS_L1D_MISS 26,758,381,920 4 [Unknown]
CYCLE_ACTIVITY.STALLS_L2_MISS 24,291,134,900 4 [Unknown]
CYCLE_ACTIVITY.STALLS_MEM_ANY 57,794,419,100 4 [Unknown]
CYCLE_ACTIVITY.STALLS_TOTAL 87,167,607,370 4 [Unknown]
ICACHE.IFDATA_STALL 15,173,609,210 4 [Unknown]
INST_RETIRED.ANY 183,617,012,740 4 [Unknown]
L1D_PEND_MISS.FB_FULL 463,431,990 4 [Unknown]
L1D_PEND_MISS.PENDING 70,342,063,340 4 [Unknown]
L2_RQSTS.ALL_CODE_RD 5,923,272,380 4 [Unknown]
L2_RQSTS.ALL_DEMAND_DATA_RD 3,577,095,470 4 [Unknown]
L2_RQSTS.ALL_DEMAND_MISS 3,304,622,310 4 [Unknown]
L2_RQSTS.ALL_DEMAND_REFERENCES 10,677,066,390 4 [Unknown]
L2_RQSTS.ALL_PF 15,687,420 4 [Unknown]
L2_RQSTS.ALL_RFO 1,151,078,890 4 [Unknown]
L2_RQSTS.DEMAND_DATA_RD_HIT 2,447,918,870 4 [Unknown]
L2_RQSTS.L2_PF_HIT 0 4 [Unknown]
L2_RQSTS.L2_PF_MISS 0 4 [Unknown]
L2_RQSTS.MISS 3,365,589,580 4 [Unknown]
L2_RQSTS.RFO_HIT 533,051,550 4 [Unknown]
L2_RQSTS.RFO_MISS 629,434,250 4 [Unknown]
MACHINE_CLEARS.COUNT 16,960,030 4 [Unknown]
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM_PS 126,786,610 4 [Unknown]
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT_PS 10,112,060 4 [Unknown]
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS_PS 4,269,390 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS 2,061,084,670 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L1_HIT 38,064,387,780 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L1_MISS 2,392,083,810 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L2_HIT 1,752,622,830 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L2_MISS 639,460,980 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L3_HIT 431,623,490 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L3_MISS 51,383,760 4 [Unknown]
MEM_UOPS_RETIRED.ALL_STORES_PS 28,383,863,750 4 [Unknown]
MEM_UOPS_RETIRED.LOCK_LOADS_PS 825,732,500 4 [Unknown]
MEM_UOPS_RETIRED.SPLIT_LOADS_PS 52,038,990 4 [Unknown]
MEM_UOPS_RETIRED.SPLIT_STORES_PS 21,278,130 4 [Unknown]
MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS 373,317,710 4 [Unknown]
MEM_UOPS_RETIRED.STLB_MISS_STORES_PS 61,491,230 4 [Unknown]
Config #2 : Disabled H/W Prefetcher and sampling mode
Hardware Events
Hardware Event Type Hardware Event Count Hardware Event Sample Count Events Per Sample Precise
BACLEARS.ANY 228,006,840 190 100003 False
BR_MISP_RETIRED.ALL_BRANCHES_PS 62,401,404 13 400009 True
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE 1,671,650,148 1,393 100003 False
CPU_CLK_UNHALTED.REF_TSC 50,250,000,000 25,125 2000000 False
CPU_CLK_UNHALTED.REF_XCLK 2,530,875,924 2,109 100003 False
CYCLE_ACTIVITY.STALLS_L1D_MISS 9,120,013,680 380 2000003 False
CYCLE_ACTIVITY.STALLS_L2_MISS 8,304,012,456 346 2000003 False
CYCLE_ACTIVITY.STALLS_MEM_ANY 18,360,027,540 765 2000003 False
CYCLE_ACTIVITY.STALLS_TOTAL 25,632,038,448 1,068 2000003 False
ICACHE.IFDATA_STALL 6,288,009,432 262 2000003 False
INST_RETIRED.ANY 47,268,000,000 23,634 2000000 False
L1D_PEND_MISS.FB_FULL 72,000,108 3 2000003 False
L1D_PEND_MISS.PENDING 22,992,034,488 958 2000003 False
L2_RQSTS.ALL_CODE_RD 2,172,032,580 905 200003 False
L2_RQSTS.ALL_DEMAND_DATA_RD 1,288,819,332 537 200003 False
L2_RQSTS.ALL_DEMAND_MISS 1,353,620,304 564 200003 False
L2_RQSTS.ALL_DEMAND_REFERENCES 3,832,857,492 1,597 200003 False
L2_RQSTS.ALL_PF 2,400,036 1 200003 False
L2_RQSTS.ALL_RFO 388,805,832 162 200003 False
L2_RQSTS.DEMAND_DATA_RD_HIT 852,012,780 355 200003 False
L2_RQSTS.L2_PF_HIT 0 0 200003 False
L2_RQSTS.L2_PF_MISS 0 0 200003 False
L2_RQSTS.MISS 1,272,019,080 530 200003 False
L2_RQSTS.RFO_HIT 172,802,592 72 200003 False
L2_RQSTS.RFO_MISS 189,602,844 79 200003 False
MACHINE_CLEARS.COUNT 2,400,072 2 100003 False
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM_PS 39,621,780 165 20011 True
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT_PS 3,361,848 14 20011 True
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS_PS 960,528 4 20011 True
MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS 840,025,200 700 100003 True
MEM_LOAD_UOPS_RETIRED.L1_HIT 10,704,016,056 446 2000003 False
MEM_LOAD_UOPS_RETIRED.L1_HIT_PS 10,656,015,984 444 2000003 True
MEM_LOAD_UOPS_RETIRED.L1_MISS 852,025,560 710 100003 False
MEM_LOAD_UOPS_RETIRED.L1_MISS_PS 852,025,560 710 100003 True
MEM_LOAD_UOPS_RETIRED.L2_HIT 615,618,468 513 100003 False
MEM_LOAD_UOPS_RETIRED.L2_HIT_PS 618,018,540 515 100003 True
MEM_LOAD_UOPS_RETIRED.L2_MISS 217,291,224 362 50021 False
MEM_LOAD_UOPS_RETIRED.L2_MISS_PS 217,291,224 362 50021 True
MEM_LOAD_UOPS_RETIRED.L3_HIT 180,075,600 300 50021 False
MEM_LOAD_UOPS_RETIRED.L3_HIT_PS 180,075,600 300 50021 True
MEM_LOAD_UOPS_RETIRED.L3_MISS 6,000,420 5 100007 False
MEM_LOAD_UOPS_RETIRED.L3_MISS_PS 6,000,420 5 100007 True
MEM_UOPS_RETIRED.ALL_STORES_PS 7,296,010,944 304 2000003 True
MEM_UOPS_RETIRED.LOCK_LOADS_PS 261,618,312 218 100007 True
MEM_UOPS_RETIRED.SPLIT_LOADS_PS 10,800,324 9 100003 True
MEM_UOPS_RETIRED.SPLIT_STORES_PS 0 0 100003 True
MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS 127,203,816 106 100003 True
MEM_UOPS_RETIRED.STLB_MISS_STORES_PS 18,000,540 15 100003 True
Config #3: Enabled H/W Prefetcher and 'Limit PMU collection to counting'
Hardware Events
Hardware Event Type Hardware Event Count Hardware Event Sample Count Events Per Sample Precise
BACLEARS.ANY 695,117,540 4 [Unknown]
BR_MISP_RETIRED.ALL_BRANCHES_PS 292,146,210 4 [Unknown]
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE 7,303,358,670 4 [Unknown]
CPU_CLK_UNHALTED.REF_TSC 206,639,106,600 4 [Unknown]
CPU_CLK_UNHALTED.REF_XCLK 10,331,982,120 4 [Unknown]
CYCLE_ACTIVITY.STALLS_L1D_MISS 41,605,016,090 4 [Unknown]
CYCLE_ACTIVITY.STALLS_L2_MISS 37,148,777,030 4 [Unknown]
CYCLE_ACTIVITY.STALLS_MEM_ANY 68,923,672,000 4 [Unknown]
CYCLE_ACTIVITY.STALLS_TOTAL 105,928,596,800 4 [Unknown]
ICACHE.IFDATA_STALL 30,367,994,640 4 [Unknown]
INST_RETIRED.ANY 189,024,991,010 4 [Unknown]
L1D_PEND_MISS.FB_FULL 484,582,810 4 [Unknown]
L1D_PEND_MISS.PENDING 106,322,399,380 4 [Unknown]
L2_RQSTS.ALL_CODE_RD 6,014,740,260 4 [Unknown]
L2_RQSTS.ALL_DEMAND_DATA_RD 3,186,052,550 4 [Unknown]
L2_RQSTS.ALL_DEMAND_MISS 4,528,008,390 4 [Unknown]
L2_RQSTS.ALL_DEMAND_REFERENCES 10,255,841,370 4 [Unknown]
L2_RQSTS.ALL_PF 11,375,132,880 4 [Unknown]
L2_RQSTS.ALL_RFO 1,048,756,160 4 [Unknown]
L2_RQSTS.DEMAND_DATA_RD_HIT 1,575,681,790 4 [Unknown]
L2_RQSTS.L2_PF_HIT 3,468,292,610 4 [Unknown]
L2_RQSTS.L2_PF_MISS 7,503,845,180 4 [Unknown]
L2_RQSTS.MISS 12,229,547,290 4 [Unknown]
L2_RQSTS.RFO_HIT 565,553,480 4 [Unknown]
L2_RQSTS.RFO_MISS 481,125,090 4 [Unknown]
MACHINE_CLEARS.COUNT 18,138,280 4 [Unknown]
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM_PS 90,361,080 4 [Unknown]
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT_PS 35,474,800 4 [Unknown]
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS_PS 16,199,850 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS 1,885,967,180 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L1_HIT 40,356,067,870 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L1_MISS 2,157,009,930 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L2_HIT 1,142,602,610 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L2_MISS 1,014,407,320 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L3_HIT 790,380,810 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L3_MISS 79,202,220 4 [Unknown]
MEM_UOPS_RETIRED.ALL_STORES_PS 29,512,105,850 4 [Unknown]
MEM_UOPS_RETIRED.LOCK_LOADS_PS 800,299,390 4 [Unknown]
MEM_UOPS_RETIRED.SPLIT_LOADS_PS 63,558,820 4 [Unknown]
MEM_UOPS_RETIRED.SPLIT_STORES_PS 30,784,420 4 [Unknown]
MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS 259,247,280 4 [Unknown]
MEM_UOPS_RETIRED.STLB_MISS_STORES_PS 55,236,070 4 [Unknown]
Even after disabling the H/W prefetcher,
(MEM_LOAD_UOPS_RETIRED.L2_MISS != MEM_LOAD_UOPS_RETIRED.L3.HIT + MEM_LOAD_UOPS_RETIRED.L3_MISS).
In counting mode, L1_MISS equals L2_HIT + L2_MISS (exactly equal) and in sampling mode, they are roughly same, but L2 and L3 cache hit miss counts never satisfies the above equation. (MEM_LOAD_UOPS_RETIRED.L3_MISS is way too off)
Similar thing happens with 'Analysis in system wide mode'.
MEM_LOAD_UOPS_RETIRED.L3_MISS has a very low 'Hardware Event Sample Count', but even with that uncertainty, the count is off.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The estimates are still only good to about 25% with 4 "hardware event sample counts".
Multiplexing the counters over this many different counter sets adds a level of uncertainty that cannot easily be quantified.
If you restrict the counters to a single set that captures the values that you are trying to compare, the results should be reliable enough for you to decide whether the counts are consistent. You only need three counters for this test:
- MEM_LOAD_UOPS_RETIRED.L2_MISS
- MEM_LOAD_UOPS_RETIRED.L3_HIT
- MEM_LOAD_UOPS_RETIRED.L3_MISS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Limiting to only 3 counters does indeed improve the consistency, not accurate though, but much better than earlier. I guess I will resort to multiple runs for various events.
Config #1: Disable HW Prefetcher and Enable counting mode
Hardware Events
Hardware Event Type Hardware Event Count Hardware Event Sample Count Events Per Sample Precise
MEM_LOAD_UOPS_RETIRED.L2_MISS 967,014,729 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L3_HIT 650,476,577 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L3_MISS 191,093,546 4 [Unknown]
Config #2 : Disable HW Prefetcher and Enable counting mode [PS Events]
Hardware Events
Hardware Event Type Hardware Event Count Hardware Event Sample Count Events Per Sample Precise
MEM_LOAD_UOPS_RETIRED.L2_MISS_PS 773,652,897 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L3_HIT_PS 521,977,568 4 [Unknown]
MEM_LOAD_UOPS_RETIRED.L3_MISS_PS 111,549,760 4 [Unknown]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know exactly what tools are available in Windows, but if you want to do arithmetic on counts, it is best to to use a tool that is designed to count and not to sample.
On Linux it is easy to use "perf stat" for whole-program counting.
I just ran a few checks to see whether these counters agree for a simple benchmark. I set up the STREAM benchmark with 200M double-precision elements per array and 10 iterations. There are 6 variables loaded in each iteration, plus 1 read in the setup code and 3 reads in the validation code. So for 10 iterations, I expect 64 loads of variables of type "double". 200M elements * 8 Bytes/element * 64 reads / 8 reads/cacheline = 1,600,000,000 cache line reads expected. Since the arrays are big (1.5 GiB each), I expect essentially all these loads to miss in the L2 and in the L3.
With HW prefetch disabled (and running on one core), I get
$ perf stat -e mem_load_retired.l2_miss -e mem_load_retired.l3_hit -e mem_load_retired.l3_miss ./stream.runtime.COMMON-AVX512.alloc.10x
1,601,323,903 mem_load_retired.l2_miss
8,342,415 mem_load_retired.l3_hit
1,592,966,125 mem_load_retired.l3_miss
The sum of L3 hit and L3 miss divided by L2 misses is .9999904 -- good to 5 digits.
With HW prefetch re-enabled (still on one core), the number of hits and misses decreases by about a factor of 2:
$ perf stat -e mem_load_retired.l2_miss -e mem_load_retired.l3_hit -e mem_load_retired.l3_miss ./stream.runtime.COMMON-AVX512.alloc.10x
779,846,893 mem_load_retired.l2_miss
4,895,328 mem_load_retired.l3_hit
774,940,945 mem_load_retired.l3_miss
Again, the sum of L3 hit and L3 miss is a very close match to L2 miss (.9999863819). Almost exactly 1/2 of the L2 misses and almost exactly 1/2 of the L3 misses "disappear" because the hardware prefetcher is able to fetch the cache lines into the corresponding level of the cache before the load gets there. In this case the HW prefetcher can't do much better because it restarts at the beginning of every 4KiB page, so it can't stay far enough "ahead" of the load stream(s).
If I run on all cores, the memory system gets busier (which increases latency, so the prefetchers are less effective at getting the data into the cache before the load arrives), and the number of L2 and L3 cache misses each increase slightly(about 3.4%):
806,611,516 mem_load_retired.l2_miss
4,928,217 mem_load_retired.l3_hit
801,367,558 mem_load_retired.l3_miss
With HW prefetch disabled and using all cores, the miss counts are a little bit (~3%) smaller than the expected values, and the sum of L3 hit and miss still matches the L2 misses to better than 4 digits. (The 3% discrepancy may be in part due to the "next page prefetcher" which can't be disabled. It would take some careful testing to try to understand the details.)
1,552,949,519 mem_load_retired.l2_miss
7,549,747 mem_load_retired.l3_hit
1,545,250,092 mem_load_retired.l3_miss
Using smaller array sizes (e.g., STREAM_ARRAY_SIZE = 3,145,728 gives exactly 24.0 MiB/array), we get a much higher rate of L3 hits. In the cases I tested, the L3 hit+miss count was about 3% lower than the L2 miss count, but it would take more detailed work to understand if that is significant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Sakura.
Hope your issue is resolved. Could you please confirm if the solutions provided here helps or if there is anything else you need help with.
Thanks
Arun Jose
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Sakura.,
We are closing this case assuming the solution provided helps. Please feel free to raise a new thread in case of further issues
Thanks
Arun Jose
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page