- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
I want to study how to use the UNC_M_CAS_COUNT.WR and UNC_M_CAS_COUNT.RD events through vtune.
but the results of the UNC_M_CAS_COUNT don't match the results of OFFCORE_RESPONSE events.
I think the Write Combine and non-temporal store may have some effect on the results.
I have wrote a very simple program:
}
I compile the program with gcc with "-O0" option.
Platform:
E7 4809 v3, Haswell-EX,
I put 3 8GB memory DIMMs in node 0, they are in 3 channels.
1 8GB memory DIMM in node 1.
Vtune 2016
first, we run the program only in node0 with numactl command.
Command line:
amplxe-cl -collect-with runsa -knob event-mode=all -knob event-config=UNC_M_CAS_COUNT.RD,UNC_M_CAS_COUNT.WR,OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_HIT.ANY_RESPONSE,OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.LOCAL_DRAM,OFFCORE_REQUESTS.DEMAND_RFO -- numactl --physcpubind=0-7 --membind=0 ./test.o
1) There are 33.5M times DEMAND_RFO l2 request, 16M caused by kernel, 16M caused by user.
There are about 16.7M times LLC hit, caused by the init loop (caused by user):
What happened to this program ?
a. the write in the init-loop lead to page fault, change to kernel.
b. the kernel initialize the memory by using the memset, (in page granularity?)
c. because Write-Combine, the memset write the data on LLC.
d. Then change to user mode, the write in the init-loop hit on LLC.
Is this right?
I want to know what happened to the memset and the init-loop.
Write Combine or Non-temporal ?
3) If I add one more init-loop in the program:
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The UNC_M_CAS_COUNT.WR counts are certainly higher than expected. I suspect that this is due to writes to the "directory" (used to filter cache coherence messages). There is not a lot of documentation on Intel's implementation of the directory filters, but in the Xeon E7/E5 v3 Uncore Performance Monitoring Reference Manual, the description of the DIRECTORY_UPDATE event in the Home Agent says that directory updates result in writes to the memory controller. I don't know what sort of directory protocol Intel is using, but one directory update for each RFO seems reasonable, and that would match the observed counts.
Some of the QPI results make sense. Remember that in the remote case socket 0 is transmitting its writebacks to socket 1, so the two transmit values (UNC_Q_TxL_FLITS_G0.DATA[UNIT1] and UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1]) both correspond to the expected number of cache lines written back. The value for received data is 1/2 of what I would have expected -- I expected to see both the allocates and the reads here (total of 1.31 billion lines), but the value only corresponds to 1/2 this value. I have seen receive-side QPI counts of 1/2 the expected values on the Xeon E5 v3 parts I have tested, but I don't have enough info to be confident that I understand what is going on here, or that the factor of 2 discrepancy that I have seen on the Xeon E5 v3 parts has any relation to this factor of 2 discrepancy on the Xeon E7 v3 parts.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
p.s
I have closed the prefetch through BIOS.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You show nothing in your code which would enable gcc to use nontemporal store. memset from glibc should include nontemporal branches in case you called it. The possible degree of write combining between your 2 loops is zero unless you reverse one loop so that the 2nd loop begins by overwriting the last few cache lines from the previous loop which may not have been flushed and data not yet evicted from cache.
You may have a question needing an expert response but you have mixed in too many inconsistencies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim P. wrote:
You show nothing in your code which would enable gcc to use nontemporal store. memset from glibc should include nontemporal branches in case you called it. The possible degree of write combining between your 2 loops is zero unless you reverse one loop so that the 2nd loop begins by overwriting the last few cache lines from the previous loop which may not have been flushed and data not yet evicted from cache.
You may have a question needing an expert response but you have mixed in too many inconsistencies.
Thank you for your response.
I'm sorry for don't make my question simple and clear.
We can think about the situation when there is only 1 loop.
Program:
#define SIZE 128*1024*1024 //1GB per array
int main(int argc, char* argv[]){
double* a = (double*)malloc(sizeof(double)*SIZE);
int i,j;
for(i=0;i<SIZE;i++){
a=i;
}
return 0;
}
I can't understand why there isn't any data read form the memory.
I think there should be at least 17M times UNC_M_CAS_COUNT.RD.
Because every 8 iterations will lead to 1 DEMAND RFO LLC MISS.
And because of the "Write allocation" mechanism, There should be 1 cache line load form the memory.
I thought may be this is because the "non-temporal store" mechanism lead to this phenomena.
But I have confirmed that, the non-temporal store will not cause any L2 request.
So I think there isn't any non-temporal store in the loop and memset.(Because the loop and memset cause about 34M L2 DEMAND RFO MISS).
So,I'm confused why the memset and loop don't cause any UNC_M_CAS_COUNT.RD (memory read)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The first time a page of virtual memory is accessed there is a major page fault that traps into the kernel to create the virtual to physical mapping. The complexity of the operation is extremely high. The O/S has to look up the NUMA policy appropriate for the process, select a page from one of the available free lists, atomically remove that page from the free list, ensure that the page has been cleared of any prior data, set up the page table mappings for the page in the user process's page table hierarchy, etc, before finally returning to the user context.
There are many possible implementations of the page instantiation process, and it would require a detailed analysis to come up with a good model for what cache and memory accesses might be required. For example, one implementation might zero each 4KiB page immediately before returning to the user context, while another implementation might extract pages from free lists of pages that have already been zeroed. The implementation of the page zeroing code might use normal stores, it might use non-temporal stores, or it might use a memory-to-memory (DMA) copy engine. Some implementations will automatically zero a page on its first read access, while other implementations set up a "copy on write" mapping to defer the page instantiation until the first store occurs.
The results above look largely reasonable. The total number of writebacks at the DRAMs matches expectations (104.3% of the expected value).
The low number of memory reads is not surprising. These are new pages being instantiated, so there is no reason to read the prior contents from memory. (I don't know what mechanism the O/S is using to avoid the expected hardware reads, but it probably involves direct manipulation of the address translation mechanisms. The Linux O/S code for this is open source, but understanding what it is doing would be a major exercise.) After the O/S initializes the page, it returns control to the user code, which finds the page in the cache.
Unless you are working on the O/S page instantiation mechanisms, this entire exercise is almost certainly irrelevant. For "real" applications, the behavior of the cache hierarchy and memory subsystem for all accesses *after* the page instantiation are the behaviors of interest.
There are three fairly obvious ways to avoid getting confused by this activity in the analysis of your results:
- Include enough accesses after the data initialization step to ensure that the counts for the data instantiation are a negligible contributor to the overall application performance counts.
- For the STREAM benchmark, I usually set the code up to run at least 100 iterations if I am going to use whole-program performance monitoring.
- If the counts are extremely repeatable, you may be able to run the code with different numbers of user-mode accesses and take the difference of the counts to remove any startup/shutdown overheads.
- E.g., for STREAM, I sometimes run a 20 iteration run and a 10 iteration run, then take the difference of the whole-program counts as being representative of the 10 extra iterations.
- This is not a particularly reliable technique because there is no strong reason to believe that the overhead will have similar counts for each execution of the program.
- The best approach is to measure the counts inline before and after the region of interest.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John McCalpin wrote:
The first time a page of virtual memory is accessed there is a major page fault that traps into the kernel to create the virtual to physical mapping. The complexity of the operation is extremely high. The O/S has to look up the NUMA policy appropriate for the process, select a page from one of the available free lists, atomically remove that page from the free list, ensure that the page has been cleared of any prior data, set up the page table mappings for the page in the user process's page table hierarchy, etc, before finally returning to the user context.
There are many possible implementations of the page instantiation process, and it would require a detailed analysis to come up with a good model for what cache and memory accesses might be required. For example, one implementation might zero each 4KiB page immediately before returning to the user context, while another implementation might extract pages from free lists of pages that have already been zeroed. The implementation of the page zeroing code might use normal stores, it might use non-temporal stores, or it might use a memory-to-memory (DMA) copy engine. Some implementations will automatically zero a page on its first read access, while other implementations set up a "copy on write" mapping to defer the page instantiation until the first store occurs.
The results above look largely reasonable. The total number of writebacks at the DRAMs matches expectations (104.3% of the expected value).
The low number of memory reads is not surprising. These are new pages being instantiated, so there is no reason to read the prior contents from memory. (I don't know what mechanism the O/S is using to avoid the expected hardware reads, but it probably involves direct manipulation of the address translation mechanisms. The Linux O/S code for this is open source, but understanding what it is doing would be a major exercise.) After the O/S initializes the page, it returns control to the user code, which finds the page in the cache.
Unless you are working on the O/S page instantiation mechanisms, this entire exercise is almost certainly irrelevant. For "real" applications, the behavior of the cache hierarchy and memory subsystem for all accesses *after* the page instantiation are the behaviors of interest.
There are three fairly obvious ways to avoid getting confused by this activity in the analysis of your results:
- Include enough accesses after the data initialization step to ensure that the counts for the data instantiation are a negligible contributor to the overall application performance counts.
- For the STREAM benchmark, I usually set the code up to run at least 100 iterations if I am going to use whole-program performance monitoring.
- If the counts are extremely repeatable, you may be able to run the code with different numbers of user-mode accesses and take the difference of the counts to remove any startup/shutdown overheads.
- E.g., for STREAM, I sometimes run a 20 iteration run and a 10 iteration run, then take the difference of the whole-program counts as being representative of the 10 extra iterations.
- This is not a particularly reliable technique because there is no strong reason to believe that the overhead will have similar counts for each execution of the program.
- The best approach is to measure the counts inline before and after the region of interest.
Thank you for your detailed explanation very much.
The reason why I study the page instantiation is that:
1. When I want to check the memory bandwidth of the benchmark.
I usually use the offcore_response events to calculate the memory bandwidth,
but I find the vtune use the UNC_M_CAS_COUNT events to calculate the memory bandwidth, I want to confirm that the results off core_response and uncore IMC events match.
Actually, I'm wondering which metrics is reasonable to be used calculate the memory bandwidth ?
2. I find that the NUMA snoop mechanism have effect on the uncore IMC events, but will not affect the offcore_response events.
And I think may be the "page instantiation" have some effect on the results.
i.e
Let's see a simple test:
When I run the benchmark on CPU0, and let it access the Local memory (Node0)
numactl --physcpubind=0-7 --membind=0 ./perfTest2.o
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
If we ignore the page initialization process.
the results of offcore_response and UNC_M_CAS_COUNT match.
17M times DEMAND_DATA_RD + 17M times DEMAND_RFO llc miss lead to 34M times UNC_M_CAS_COUNT.RD and 17M times UNC_M_CAS_COUNT.WR.
but when I run the benchmark on CPU0 and let it access the remote memory(Node1's memory).
numactl --physcpubind=0-7 --membind=1 ./perfTest2.o
you see, the results of offcore_response is still the same.
But the results of UNC_M_CAS_COUNT is much larger than the previous result(cpu0 mem0).
The UNC_M_CAS_COUNT.WR's result is unbelievable.
We can check the results of the QPI.
1 QPI FLITS is 8 byte data,
From the result of UNC_Q_RxL_FLITS_G1.DRS_DATA we can get that, the CPU0 only get 17M times cache line(137M/8)from Node 1's memory.
But the result of UNC_M_CAS_COUNT.RD show that, the benchmark has got 51M times cache line.
So,which metrics can I use to calculate the memory bandwidth ?
The UNC_M_CAS_COUNT or the OFFCORE_RESPONSE ?
The results of these 2 metrics are so different.
Thank you again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that you are going to need to figure out how to eliminate the effects of the page instantiation from the results before you will be able to make any sense out of the numbers. I recommended two ways to do this in my previous note.
Your local case looks consistent with the previous results:
- DRAM reads correspond to 2 full arrays (plus ~8.7% of another array). I interpret this as:
- Approximately zero reads for the instantiation loop (as in the previous case).
- The ~8.7% extra reads may be associated with the page instantiation process.
- One array for the allocate in the second loop.
- One array read for the data summation in the third loop.
- Approximately zero reads for the instantiation loop (as in the previous case).
- DRAM writes correspond to 2 full arrays (plus ~6.3% of another array). I interpret this as:
- One write for the instantiation in the first loop (as seen in the previous tests).
- One write for the writebacks in the second loop.
- Zero writes for the data summation in the third loop.
The remote case is different, but it may be different because the page instantiation is different for remote data, and you have not provided data for that case. That is why I recommend either running the user-space kernels at least 100 times for each data instantiation, or reading the performance counters inline (before and after each region of interest).
The QPI counters in the remote case may or may not be informative because we don't know what is happening in the page instantiation process. I have seen what I believe are lots of bugs in these QPI events on Xeon E5 v3, but I have no way of knowing if these apply to the Xeon E7 v3, which has some significant differences in both the uncore and in the coherence protocol. Again, running the kernels at least 100 times for each data instantiation or reading the performance counters inline would help clarify the accuracy of the QPI counts.
For simple tests like these, adding another loop to repeat loop 2 and loop 3 each 100 or 1000 times should still result in a suitably short-running program. You don't need the full GiB of data to get good results -- anything significantly larger than the 20MiB L3 cache should work. (I typically recommend making each array 4x the size of the total cache used, which would reduce the run-time without significantly changing the results.) Increasing the number of iterations of the post-instantiation data access loops will also increase the number of samples seen by VTune, which should make it possible to see which events are associated with each of the three loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi,
Thank you for your help very much.
Yes, the local case is reasonable.
It's right, I should eliminate the effects of the loop instantiation.
Now I run the user-space kernel 500 times, and change the array size to 80M. (4x LLC size)
The program:
#define SIZE 10*1024*1024 //80M per array, 4X LLC size int main(int argc, char* argv[]){ double* a = (double*)malloc(sizeof(double)*SIZE); int i,j; double sum=0; for(i=0;i<SIZE;i++){ a=i; } for(i=0;i<500;i++){ for(j=0;j<SIZE;j++){ //2nd loop a=2*j; } for(j=0;j<SIZE;j+=8){ //3rd loop sum += a ; } } printf("sum is %f \n",sum); return 0; }
The summary (a table, if it can't show correctly, you can see the raw data below):
cpu0 mem0 | cpu0 mem1 | |
OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM | 655,319,659 (625M) | 655,319,659 (+ 0%) |
OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM | 655,319,659 (625M) | 655,119,653 (+ 0%) |
UNC_M_CAS_COUNT.RD | 1,353,329,965 (1290M) | 1,334,112,565 (- 1.4%) |
UNC_M_CAS_COUNT.WR | 688,390,110 (656M) | 1,325,120,659 (+92.5%) |
UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1] | 25,300,152 | 5,329,582,928 (635M cache line) |
UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT1] | 15,646,144 | 5,195,936,072 (619M cache line) |
memory read:
a. 2nd loop for write lead to 1.25M * 500 = 625M times memory read,
b. 3rd loop for read lead to 1.25M * 500 = 625M times memory read.
memory write:
a. 2nd loop for write lead to 1.25M* 500 = 625M times cache line write.
b. the memory write caused by loop instantiation is negligible in this case I think.
total: about 625M times memory write.
1. The remote memory access case caused about 92.5% more memory write.
The off core_response event is still the same.
2. The QPI data isn't reasonable I think.
I paste the raw data below.
************************************************************************************************************
************************************************************************************************************
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The UNC_M_CAS_COUNT.WR counts are certainly higher than expected. I suspect that this is due to writes to the "directory" (used to filter cache coherence messages). There is not a lot of documentation on Intel's implementation of the directory filters, but in the Xeon E7/E5 v3 Uncore Performance Monitoring Reference Manual, the description of the DIRECTORY_UPDATE event in the Home Agent says that directory updates result in writes to the memory controller. I don't know what sort of directory protocol Intel is using, but one directory update for each RFO seems reasonable, and that would match the observed counts.
Some of the QPI results make sense. Remember that in the remote case socket 0 is transmitting its writebacks to socket 1, so the two transmit values (UNC_Q_TxL_FLITS_G0.DATA[UNIT1] and UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1]) both correspond to the expected number of cache lines written back. The value for received data is 1/2 of what I would have expected -- I expected to see both the allocates and the reads here (total of 1.31 billion lines), but the value only corresponds to 1/2 this value. I have seen receive-side QPI counts of 1/2 the expected values on the Xeon E5 v3 parts I have tested, but I don't have enough info to be confident that I understand what is going on here, or that the factor of 2 discrepancy that I have seen on the Xeon E5 v3 parts has any relation to this factor of 2 discrepancy on the Xeon E7 v3 parts.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much!
I think what you said about the DIRECTORY_UPDATE is right.
I use the same program to test the PMU event: UNC_H_DIRECTORY_UPDATE.ANY
You can see form the raw data below,
the UNC_H_DIRECTORY_UPDATE.ANY causes 631M times memory write.
This value is almost the same as extra memory write.
So, I think what you said is reasonable.
amplxe-cl -collect-with runsa -knob event-mode=all -knob event-config=UNC_M_CAS_COUNT.RD,UNC_M_CAS_COUNT.WR,UNC_Q_TxL_FLITS_G0.DATA,UNC_Q_TxL_FLITS_G1.DRS_DATA,UNC_Q_RxL_FLITS_G1.DRS_DATA,OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM,OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM,UNC_H_DIRECTORY_UPDATE.ANY -- numactl --physcpubind=0-7 --membind=1 ./test.o

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page