Solved: Write Combine and non-temporal store's effect on IMC uncore events in Haswel-EX

wang_c_ · ‎04-14-2016

Hi everyone,

I want to study how to use the UNC_M_CAS_COUNT.WR and UNC_M_CAS_COUNT.RD events through vtune.

but the results of the UNC_M_CAS_COUNT don't match the results of OFFCORE_RESPONSE events.

I think the Write Combine and non-temporal store may have some effect on the results.

I have wrote a very simple program:

Program:

#define SIZE  128*1024*1024  //1GB per array

int main(int argc, char* argv[]){

  double* a = (double*)malloc(sizeof(double)*SIZE);

  int i,j;

  for(i=0;i<SIZE;i++){

    a=i;

}

  return 0;

}

I compile the program with gcc with "-O0" option.

Platform:

E7 4809 v3, Haswell-EX,

I put 3 8GB memory DIMMs in node 0, they are in 3 channels.

1 8GB memory DIMM in node 1.

Vtune 2016

first, we run the program only in node0 with numactl command.

Command line:

amplxe-cl -collect-with runsa -knob event-mode=all -knob event-config=UNC_M_CAS_COUNT.RD,UNC_M_CAS_COUNT.WR,OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_HIT.ANY_RESPONSE,OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.LOCAL_DRAM,OFFCORE_REQUESTS.DEMAND_RFO -- numactl --physcpubind=0-7 --membind=0 ./test.o

Result:
Event summary
-------------
Hardware Event Type                               Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
-------------------------------------------------  -------------------------  -------------------
CPU_CLK_UNHALTED.REF_TSC                                            1980002970         990  2000003
OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_HIT.ANY_RESPONSE     16700501         167  100003
OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.LOCAL_DRAM        500015          5  100003
OFFCORE_REQUESTS.DEMAND_RFO                                           33501005         335  100003

Results:

Uncore Event summary

--------------------

Uncore Event Type                      Uncore Event Count:Self

-------------------------------------  -----------------------

UNC_M_CAS_COUNT.RD[UNIT0]                               883086

UNC_M_CAS_COUNT.RD[UNIT1]                               395407

UNC_M_CAS_COUNT.RD[UNIT2]                                    0

UNC_M_CAS_COUNT.RD[UNIT3]                                    0

UNC_M_CAS_COUNT.RD[UNIT4]                               143047

UNC_M_CAS_COUNT.RD[UNIT5]                                    0

UNC_M_CAS_COUNT.RD[UNIT6]                                    0

UNC_M_CAS_COUNT.RD[UNIT7]                                    0

UNC_M_CAS_COUNT.WR[UNIT0]                              4443735

UNC_M_CAS_COUNT.WR[UNIT1]                              4494538

UNC_M_CAS_COUNT.WR[UNIT2]                                    0

UNC_M_CAS_COUNT.WR[UNIT3]                                    0

UNC_M_CAS_COUNT.WR[UNIT4]                              8555564

UNC_M_CAS_COUNT.WR[UNIT5]                                    0

UNC_M_CAS_COUNT.WR[UNIT6]                                    0

UNC_M_CAS_COUNT.WR[UNIT7]                                    0

1) There are 33.5M times DEMAND_RFO l2 request, 16M caused by kernel, 16M caused by user.

There are about 16.7M times LLC hit, caused by the init loop (caused by user):

  for(i=0;i<SIZE;i++){

    a=i;

}

This is quit weird, there is no any DEMAND_RFO L3 MISS,

And the UNC_M_CAS_COUNT.RD is almost 0 in 3 channels,

so the there is no any Write Allocation.

But I don't find any non-temporal store instructions in the memset()'s assembly. There are only some SIMD instructions in memset. 

Why there is no write allocation ?

2) Why does the init-loop can hit on the LLC?

I have checked that, the kernel will invoke memset to initialize the new allocated memory.

What happened to this program ?

a. the write in the init-loop lead to page fault, change to kernel.

b. the kernel initialize the memory by using the memset, (in page granularity?)

c. because Write-Combine, the memset write the data on LLC.

d. Then change to user mode, the write in the init-loop hit on LLC.

Is this right?

I want to know what happened to the memset and the init-loop.

Write Combine or Non-temporal ?

3) If I add one more init-loop in the program:

#define SIZE  128*1024*1024  //1GB per array

int main(int argc, char* argv[]){

  double* a = (double*)malloc(sizeof(double)*SIZE);

  int i,j;

  double sum=0;

  for(i=0;i<SIZE;i++){

    a=i;

}

  for(i=0;i<SIZE;i++){ //one more init-loop

    a=2*i;

}

  return 0;

}

Event summary

-------------

Hardware Event Type                            Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample

--------------------------------------   -------------------------  -----------------------

CPU_CLK_UNHALTED.REF_TSC                                              7150010725      3575  2000003

OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_HIT.ANY_RESPONSE     16700501        167  100003

OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.LOCAL_DRAM      17300519        173  100003

OFFCORE_REQUESTS.DEMAND_RFO                                           50301509        503  100003

Uncore Event summary

--------------------

Uncore Event Type                      Uncore Event Count:Self

-------------------------------------  -----------------------

UNC_M_CAS_COUNT.RD[UNIT0]                              5301996

UNC_M_CAS_COUNT.RD[UNIT1]                              4945012

UNC_M_CAS_COUNT.RD[UNIT2]                                    0

UNC_M_CAS_COUNT.RD[UNIT3]                                    0

UNC_M_CAS_COUNT.RD[UNIT4]                              8879126

UNC_M_CAS_COUNT.RD[UNIT5]                                    0

UNC_M_CAS_COUNT.RD[UNIT6]                                    0

UNC_M_CAS_COUNT.RD[UNIT7]                                    0

UNC_M_CAS_COUNT.WR[UNIT0]                              8772992

UNC_M_CAS_COUNT.WR[UNIT1]                              9000039

UNC_M_CAS_COUNT.WR[UNIT2]                                    0

UNC_M_CAS_COUNT.WR[UNIT3]                                    0

UNC_M_CAS_COUNT.WR[UNIT4]                             17238871

UNC_M_CAS_COUNT.WR[UNIT5]                                    0

UNC_M_CAS_COUNT.WR[UNIT6]                                    0

UNC_M_CAS_COUNT.WR[UNIT7]                                    0

The results seem all right.

The 2nd init-loop cause 17M times DEMAND_RFO l2 request, 17M times DEMAND_RFO LLC miss, 17M UNC_M_CAS_COUNT.RD and 17M UNC_M_CAS_COUNT.WR,

Then, why didn't the Write Combine happen ?

I'm confused about what happened inside the CPU.

If you have any idea, please tell me, I appreciate it very much.

Thank you.

McCalpinJohn · ‎04-19-2016

The UNC_M_CAS_COUNT.WR counts are certainly higher than expected. I suspect that this is due to writes to the "directory" (used to filter cache coherence messages). There is not a lot of documentation on Intel's implementation of the directory filters, but in the Xeon E7/E5 v3 Uncore Performance Monitoring Reference Manual, the description of the DIRECTORY_UPDATE event in the Home Agent says that directory updates result in writes to the memory controller. I don't know what sort of directory protocol Intel is using, but one directory update for each RFO seems reasonable, and that would match the observed counts.

Some of the QPI results make sense. Remember that in the remote case socket 0 is transmitting its writebacks to socket 1, so the two transmit values (UNC_Q_TxL_FLITS_G0.DATA[UNIT1] and UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1]) both correspond to the expected number of cache lines written back. The value for received data is 1/2 of what I would have expected -- I expected to see both the allocates and the reads here (total of 1.31 billion lines), but the value only corresponds to 1/2 this value. I have seen receive-side QPI counts of 1/2 the expected values on the Xeon E5 v3 parts I have tested, but I don't have enough info to be confident that I understand what is going on here, or that the factor of 2 discrepancy that I have seen on the Xeon E5 v3 parts has any relation to this factor of 2 discrepancy on the Xeon E7 v3 parts.

View solution in original post

wang_c_ · ‎04-14-2016

p.s

I have closed the prefetch through BIOS.

The values of register 0x1a4 are all 0xf.

sudo rdmsr -p 0 0x1a4

f

TimP · ‎04-16-2016

You show nothing in your code which would enable gcc to use nontemporal store. memset from glibc should include nontemporal branches in case you called it. The possible degree of write combining between your 2 loops is zero unless you reverse one loop so that the 2nd loop begins by overwriting the last few cache lines from the previous loop which may not have been flushed and data not yet evicted from cache.

You may have a question needing an expert response but you have mixed in too many inconsistencies.

wang_c_ · ‎04-17-2016

Tim P. wrote:

You show nothing in your code which would enable gcc to use nontemporal store. memset from glibc should include nontemporal branches in case you called it. The possible degree of write combining between your 2 loops is zero unless you reverse one loop so that the 2nd loop begins by overwriting the last few cache lines from the previous loop which may not have been flushed and data not yet evicted from cache.

You may have a question needing an expert response but you have mixed in too many inconsistencies.

Thank you for your response.
I'm sorry for don't make my question simple and clear.

We can think about the situation when there is only 1 loop.

Program:

#define SIZE 128*1024*1024 //1GB per array

int main(int argc, char* argv[]){

double* a = (double*)malloc(sizeof(double)*SIZE);

int i,j;

for(i=0;i<SIZE;i++){

a=i;

}
return 0;

}

I can't understand why there isn't any data read form the memory.
I think there should be at least 17M times UNC_M_CAS_COUNT.RD.
Because every 8 iterations will lead to 1 DEMAND RFO LLC MISS.
And because of the "Write allocation" mechanism, There should be 1 cache line load form the memory.

I thought may be this is because the "non-temporal store" mechanism lead to this phenomena.
But I have confirmed that, the non-temporal store will not cause any L2 request.
So I think there isn't any non-temporal store in the loop and memset.(Because the loop and memset cause about 34M L2 DEMAND RFO MISS).

So,I'm confused why the memset and loop don't cause any UNC_M_CAS_COUNT.RD (memory read)?

McCalpinJohn · ‎04-17-2016

The first time a page of virtual memory is accessed there is a major page fault that traps into the kernel to create the virtual to physical mapping. The complexity of the operation is extremely high. The O/S has to look up the NUMA policy appropriate for the process, select a page from one of the available free lists, atomically remove that page from the free list, ensure that the page has been cleared of any prior data, set up the page table mappings for the page in the user process's page table hierarchy, etc, before finally returning to the user context.

There are many possible implementations of the page instantiation process, and it would require a detailed analysis to come up with a good model for what cache and memory accesses might be required. For example, one implementation might zero each 4KiB page immediately before returning to the user context, while another implementation might extract pages from free lists of pages that have already been zeroed. The implementation of the page zeroing code might use normal stores, it might use non-temporal stores, or it might use a memory-to-memory (DMA) copy engine. Some implementations will automatically zero a page on its first read access, while other implementations set up a "copy on write" mapping to defer the page instantiation until the first store occurs.

The results above look largely reasonable. The total number of writebacks at the DRAMs matches expectations (104.3% of the expected value).

The low number of memory reads is not surprising. These are new pages being instantiated, so there is no reason to read the prior contents from memory. (I don't know what mechanism the O/S is using to avoid the expected hardware reads, but it probably involves direct manipulation of the address translation mechanisms. The Linux O/S code for this is open source, but understanding what it is doing would be a major exercise.) After the O/S initializes the page, it returns control to the user code, which finds the page in the cache.

Unless you are working on the O/S page instantiation mechanisms, this entire exercise is almost certainly irrelevant. For "real" applications, the behavior of the cache hierarchy and memory subsystem for all accesses *after* the page instantiation are the behaviors of interest.

There are three fairly obvious ways to avoid getting confused by this activity in the analysis of your results:

Include enough accesses after the data initialization step to ensure that the counts for the data instantiation are a negligible contributor to the overall application performance counts.
1. For the STREAM benchmark, I usually set the code up to run at least 100 iterations if I am going to use whole-program performance monitoring.
If the counts are extremely repeatable, you may be able to run the code with different numbers of user-mode accesses and take the difference of the counts to remove any startup/shutdown overheads.
1. E.g., for STREAM, I sometimes run a 20 iteration run and a 10 iteration run, then take the difference of the whole-program counts as being representative of the 10 extra iterations.
2. This is not a particularly reliable technique because there is no strong reason to believe that the overhead will have similar counts for each execution of the program.
The best approach is to measure the counts inline before and after the region of interest.

wang_c_ · ‎04-17-2016

John McCalpin wrote:

The first time a page of virtual memory is accessed there is a major page fault that traps into the kernel to create the virtual to physical mapping. The complexity of the operation is extremely high. The O/S has to look up the NUMA policy appropriate for the process, select a page from one of the available free lists, atomically remove that page from the free list, ensure that the page has been cleared of any prior data, set up the page table mappings for the page in the user process's page table hierarchy, etc, before finally returning to the user context.

There are many possible implementations of the page instantiation process, and it would require a detailed analysis to come up with a good model for what cache and memory accesses might be required. For example, one implementation might zero each 4KiB page immediately before returning to the user context, while another implementation might extract pages from free lists of pages that have already been zeroed. The implementation of the page zeroing code might use normal stores, it might use non-temporal stores, or it might use a memory-to-memory (DMA) copy engine. Some implementations will automatically zero a page on its first read access, while other implementations set up a "copy on write" mapping to defer the page instantiation until the first store occurs.

The results above look largely reasonable. The total number of writebacks at the DRAMs matches expectations (104.3% of the expected value).

The low number of memory reads is not surprising. These are new pages being instantiated, so there is no reason to read the prior contents from memory. (I don't know what mechanism the O/S is using to avoid the expected hardware reads, but it probably involves direct manipulation of the address translation mechanisms. The Linux O/S code for this is open source, but understanding what it is doing would be a major exercise.) After the O/S initializes the page, it returns control to the user code, which finds the page in the cache.

Unless you are working on the O/S page instantiation mechanisms, this entire exercise is almost certainly irrelevant. For "real" applications, the behavior of the cache hierarchy and memory subsystem for all accesses *after* the page instantiation are the behaviors of interest.

There are three fairly obvious ways to avoid getting confused by this activity in the analysis of your results:

Include enough accesses after the data initialization step to ensure that the counts for the data instantiation are a negligible contributor to the overall application performance counts.

For the STREAM benchmark, I usually set the code up to run at least 100 iterations if I am going to use whole-program performance monitoring.

If the counts are extremely repeatable, you may be able to run the code with different numbers of user-mode accesses and take the difference of the counts to remove any startup/shutdown overheads.

E.g., for STREAM, I sometimes run a 20 iteration run and a 10 iteration run, then take the difference of the whole-program counts as being representative of the 10 extra iterations.

This is not a particularly reliable technique because there is no strong reason to believe that the overhead will have similar counts for each execution of the program.

The best approach is to measure the counts inline before and after the region of interest.

Thank you for your detailed explanation very much.
The reason why I study the page instantiation is that:
1. When I want to check the memory bandwidth of the benchmark.
   I usually use the offcore_response events to calculate the memory bandwidth,
   but I find the vtune use the UNC_M_CAS_COUNT events to calculate the memory bandwidth, I want to confirm that the results off core_response and uncore IMC events match.
  Actually, I'm wondering which metrics is reasonable to be used calculate the memory bandwidth ?

2. I find that the NUMA snoop mechanism have effect on the uncore IMC events, but will not affect the offcore_response events.
   And I think may be the "page instantiation" have some effect on the results.

i.e
Let's see a simple test:

#define SIZE  128*1024*1024  //1GB per array

int main(int argc, char* argv[]){

  double* a = (double*)malloc(sizeof(double)*SIZE);

  int i,j;

  double sum=0;

  sleep(1);

  for(i=0;i<SIZE;i++){

    a=i;

}

  sleep(1);

  for(i=0;i<SIZE;i++){

    a=2*i;

}

  sleep(1);

  for(j=0;j<SIZE;j+=8){

    sum += a;

}

  printf("sum is %f \n",sum);

  return 0;

}

When I run the benchmark on CPU0, and let it access the Local memory (Node0)
numactl --physcpubind=0-7 --membind=0 ./perfTest2.o

Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample

-------------------------------------------   --------------------------------  -----------------

CPU_CLK_UNHALTED.REF_TSC                                           7768011652       3884  2000003

OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM   17300519        173  100003

OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM       17300519        173  100003

Uncore Event summary

--------------------

Uncore Event Type                   Uncore Event Count:Self

----------------------------------  -----------------------

UNC_M_CAS_COUNT.RD[UNIT0]                          11014830 //channel 0

UNC_M_CAS_COUNT.RD[UNIT1]                          10522147 //channel 1

UNC_M_CAS_COUNT.RD[UNIT2]                                 0

UNC_M_CAS_COUNT.RD[UNIT3]                                 0

UNC_M_CAS_COUNT.RD[UNIT4]                          14927860 //channel 2

UNC_M_CAS_COUNT.RD[UNIT5]                                 0

UNC_M_CAS_COUNT.RD[UNIT6]                                 0

UNC_M_CAS_COUNT.RD[UNIT7]                                 0

UNC_M_CAS_COUNT.WR[UNIT0]                          10471364

UNC_M_CAS_COUNT.WR[UNIT1]                          10374956

UNC_M_CAS_COUNT.WR[UNIT2]                                 0

UNC_M_CAS_COUNT.WR[UNIT3]                                 0

UNC_M_CAS_COUNT.WR[UNIT4]                          14806531

UNC_M_CAS_COUNT.WR[UNIT5]                                 0

UNC_M_CAS_COUNT.WR[UNIT6]                                 0

UNC_M_CAS_COUNT.WR[UNIT7]                                 0

UNC_Q_TxL_FLITS_G0.DATA[UNIT0]                      1262272

UNC_Q_TxL_FLITS_G0.DATA[UNIT1]                      1354912

UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT0]                  1138840

UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1]                  1251440

UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT0]                   571000

UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT1]                   828192

If we ignore the page initialization process.
the results of offcore_response and UNC_M_CAS_COUNT match.
17M times DEMAND_DATA_RD + 17M times DEMAND_RFO llc miss lead to 34M times UNC_M_CAS_COUNT.RD and 17M times UNC_M_CAS_COUNT.WR.

but when I run the benchmark on CPU0 and let it access the remote memory(Node1's memory).
numactl --physcpubind=0-7 --membind=1 ./perfTest2.o

Hardware Event Type                               Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample

----------------------------------- ------------- --------------------------------  -----------------

CPU_CLK_UNHALTED.REF_TSC                                              11268016902       5634  2000003

OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM       17300519        173  100003

OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM           17300519        173  100003

Uncore Event summary

--------------------

Uncore Event Type                   Uncore Event Count:Self

----------------------------------  -----------------------

UNC_M_CAS_COUNT.RD[UNIT0]                          51435600 //channel 0

UNC_M_CAS_COUNT.RD[UNIT1]                            898140

UNC_M_CAS_COUNT.RD[UNIT2]                                 0

UNC_M_CAS_COUNT.RD[UNIT3]                                 0

UNC_M_CAS_COUNT.RD[UNIT4]                            821301

UNC_M_CAS_COUNT.RD[UNIT5]                                 0

UNC_M_CAS_COUNT.RD[UNIT6]                                 0

UNC_M_CAS_COUNT.RD[UNIT7]                                 0

UNC_M_CAS_COUNT.WR[UNIT0]                          84976356 //channel 0

UNC_M_CAS_COUNT.WR[UNIT1]                            811290

UNC_M_CAS_COUNT.WR[UNIT2]                                 0

UNC_M_CAS_COUNT.WR[UNIT3]                                 0

UNC_M_CAS_COUNT.WR[UNIT4]                            716708

UNC_M_CAS_COUNT.WR[UNIT5]                                 0

UNC_M_CAS_COUNT.WR[UNIT6]                                 0

UNC_M_CAS_COUNT.WR[UNIT7]                                 0

UNC_Q_TxL_FLITS_G0.DATA[UNIT0]                      1484912

UNC_Q_TxL_FLITS_G0.DATA[UNIT1]                    272937640

UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT0]                  1352688

UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1]                272803960

UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT0]                   693000

UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT1]                137443728

you see, the results of offcore_response is still the same.
But the results of UNC_M_CAS_COUNT is much larger than the previous result(cpu0 mem0).
The UNC_M_CAS_COUNT.WR's result is unbelievable.

We can check the results of the QPI.
1 QPI FLITS is 8 byte data,
From the result of UNC_Q_RxL_FLITS_G1.DRS_DATA we can get that, the CPU0 only get 17M times cache line(137M/8)from Node 1's memory.
But the result of UNC_M_CAS_COUNT.RD show that, the benchmark has got 51M times cache line.

So,which metrics can I use to calculate the memory bandwidth ?
The UNC_M_CAS_COUNT or the OFFCORE_RESPONSE ?
The results of these 2 metrics are so different.

Thank you again.

McCalpinJohn · ‎04-18-2016

I think that you are going to need to figure out how to eliminate the effects of the page instantiation from the results before you will be able to make any sense out of the numbers. I recommended two ways to do this in my previous note.

Your local case looks consistent with the previous results:

DRAM reads correspond to 2 full arrays (plus ~8.7% of another array). I interpret this as:
- Approximately zero reads for the instantiation loop (as in the previous case).
  - The ~8.7% extra reads may be associated with the page instantiation process.
- One array for the allocate in the second loop.
- One array read for the data summation in the third loop.
DRAM writes correspond to 2 full arrays (plus ~6.3% of another array). I interpret this as:
- One write for the instantiation in the first loop (as seen in the previous tests).
- One write for the writebacks in the second loop.
- Zero writes for the data summation in the third loop.

The remote case is different, but it may be different because the page instantiation is different for remote data, and you have not provided data for that case. That is why I recommend either running the user-space kernels at least 100 times for each data instantiation, or reading the performance counters inline (before and after each region of interest).

The QPI counters in the remote case may or may not be informative because we don't know what is happening in the page instantiation process. I have seen what I believe are lots of bugs in these QPI events on Xeon E5 v3, but I have no way of knowing if these apply to the Xeon E7 v3, which has some significant differences in both the uncore and in the coherence protocol. Again, running the kernels at least 100 times for each data instantiation or reading the performance counters inline would help clarify the accuracy of the QPI counts.

For simple tests like these, adding another loop to repeat loop 2 and loop 3 each 100 or 1000 times should still result in a suitably short-running program. You don't need the full GiB of data to get good results -- anything significantly larger than the 20MiB L3 cache should work. (I typically recommend making each array 4x the size of the total cache used, which would reduce the run-time without significantly changing the results.) Increasing the number of iterations of the post-instantiation data access loops will also increase the number of samples seen by VTune, which should make it possible to see which events are associated with each of the three loops.

wang_c_ · ‎04-18-2016

hi,
Thank you for your help very much.
Yes, the local case is reasonable.
It's right, I should eliminate the effects of the loop instantiation.

Now I run the user-space kernel 500 times, and change the array size to 80M. (4x LLC size)

The program:

#define SIZE  10*1024*1024  //80M per array, 4X LLC size

int main(int argc, char* argv[]){

  double* a = (double*)malloc(sizeof(double)*SIZE);
  int i,j;
  double sum=0;

  for(i=0;i<SIZE;i++){
    a=i;
  }

  for(i=0;i<500;i++){
    for(j=0;j<SIZE;j++){ //2nd loop
      a=2*j;
    }

    for(j=0;j<SIZE;j+=8){ //3rd loop
      sum += a;
    }
  }
  printf("sum is %f \n",sum);
  return 0;
}

The summary (a table, if it can't show correctly, you can see the raw data below):

	cpu0 mem0	cpu0 mem1
OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM	655,319,659 (625M)	655,319,659 (+ 0%)
OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM	655,319,659 (625M)	655,119,653 (+ 0%)
UNC_M_CAS_COUNT.RD	1,353,329,965 (1290M)	1,334,112,565 (- 1.4%)
UNC_M_CAS_COUNT.WR	688,390,110 (656M)	1,325,120,659 (+92.5%)
UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1]	25,300,152	5,329,582,928 (635M cache line)
UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT1]	15,646,144	5,195,936,072 (619M cache line)

The local case is still reasonable.
memory read:
a. 2nd loop for write lead to 1.25M * 500 = 625M times memory read,
b. 3rd loop for read lead to 1.25M * 500 = 625M times memory read.

total : about 1250M times memory read

memory write:
a. 2nd loop for write lead to 1.25M* 500 = 625M times cache line write.
b. the memory write caused by loop instantiation is negligible in this case I think.
total: about 625M times memory write.

1. The remote memory access case caused about 92.5% more memory write.

The off core_response event is still the same.

2. The QPI data isn't reasonable I think.

I paste the raw data below.
************************************************************************************************************

cpu0 mem0

amplxe-cl  -collect-with runsa -knob event-mode=all -knob event-config=UNC_M_CAS_COUNT.RD,UNC_M_CAS_COUNT.WR,UNC_Q_TxL_FLITS_G0.DATA,UNC_Q_TxL_FLITS_G1.DRS_DATA,UNC_Q_RxL_FLITS_G1.DRS_DATA,OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM,OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM  --  numactl --physcpubind=0-7 --membind=0 ./test.o

-----------------  -----------------------------------
Name               Intel(R) Xeon(R) E5/E7 v3 processor
Frequency          1995185500
Logical CPU Count  32
 
Summary
-------
Elapsed Time:       115.069
CPU Time:           112.186
Average CPU Usage:  1.000
 
Average Bandwidth
-----------------
Package    Total, GB/sec:Self
---------  ------------------
package_0               1.132
package_1               0.004
package_2                 0.0
package_3                 0.0
 
Event summary
-------------
Hardware Event Type                   Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
-----------------  -------------------------                     -----------  -----------------
CPU_CLK_UNHALTED.REF_TSC                                            223832335748     111916  2000003
OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM     655319659       6553  100003
OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM         655219656       6552  100003
 
Uncore Event summary
--------------------
Uncore Event Type                   Uncore Event Count:Self
----------------------------------  -----------------------
UNC_M_CAS_COUNT.RD[UNIT0]                         394957726
UNC_M_CAS_COUNT.RD[UNIT1]                         399774343
UNC_M_CAS_COUNT.RD[UNIT2]                                 0
UNC_M_CAS_COUNT.RD[UNIT3]                                 0
UNC_M_CAS_COUNT.RD[UNIT4]                         558597896
UNC_M_CAS_COUNT.RD[UNIT5]                                 0
UNC_M_CAS_COUNT.RD[UNIT6]                                 0
UNC_M_CAS_COUNT.RD[UNIT7]                                 0
UNC_M_CAS_COUNT.WR[UNIT0]                         197876777
UNC_M_CAS_COUNT.WR[UNIT1]                         206031695
UNC_M_CAS_COUNT.WR[UNIT2]                                 0
UNC_M_CAS_COUNT.WR[UNIT3]                                 0
UNC_M_CAS_COUNT.WR[UNIT4]                         284481638
UNC_M_CAS_COUNT.WR[UNIT5]                                 0
UNC_M_CAS_COUNT.WR[UNIT6]                                 0
UNC_M_CAS_COUNT.WR[UNIT7]                                 0
UNC_Q_TxL_FLITS_G0.DATA[UNIT0]                     27022704
UNC_Q_TxL_FLITS_G0.DATA[UNIT1]                     26758288
UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT0]                 25531848
UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1]                 25300152
UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT0]                 12508128

UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT1]                 15646144

************************************************************************************************************

cpu0 mem1
 

amplxe-cl  -collect-with runsa -knob event-mode=all -knob event-config=UNC_M_CAS_COUNT.RD,UNC_M_CAS_COUNT.WR,UNC_Q_TxL_FLITS_G0.DATA,UNC_Q_TxL_FLITS_G1.DRS_DATA,UNC_Q_RxL_FLITS_G1.DRS_DATA,OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM,OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM  --  numactl --physcpubind=0-7 --membind=1 ./test.o

-----------------  -----------------------------------
Name               Intel(R) Xeon(R) E5/E7 v3 processor
Frequency          1995185498
Logical CPU Count  32
 
Summary
-------
Elapsed Time:       247.192
CPU Time:           241.444
Average CPU Usage:  1.000
 
Average Bandwidth
-----------------
Package    Total, GB/sec:Self
---------  ------------------
package_0               0.036
package_1               0.683
package_2                 0.0
package_3                 0.0
 
Event summary
-------------
Hardware Event Type                       Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
-----------------------------------------------------   --------------------------  -----------------
CPU_CLK_UNHALTED.REF_TSC                                            481724722586      240862  2000003
OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM     655319659        6553  100003
OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM         655119653        6551  100003
 
Uncore Event summary
--------------------
Uncore Event Type                   Uncore Event Count:Self
----------------------------------  -----------------------
UNC_M_CAS_COUNT.RD[UNIT0]                        1334112565
UNC_M_CAS_COUNT.RD[UNIT1]                          32166480
UNC_M_CAS_COUNT.RD[UNIT2]                                 0
UNC_M_CAS_COUNT.RD[UNIT3]                                 0
UNC_M_CAS_COUNT.RD[UNIT4]                          31010045
UNC_M_CAS_COUNT.RD[UNIT5]                                 0
UNC_M_CAS_COUNT.RD[UNIT6]                                 0
UNC_M_CAS_COUNT.RD[UNIT7]                                 0
UNC_M_CAS_COUNT.WR[UNIT0]                        1325120659
UNC_M_CAS_COUNT.WR[UNIT1]                          28595349
UNC_M_CAS_COUNT.WR[UNIT2]                                 0
UNC_M_CAS_COUNT.WR[UNIT3]                                 0
UNC_M_CAS_COUNT.WR[UNIT4]                          26366138
UNC_M_CAS_COUNT.WR[UNIT5]                                 0
UNC_M_CAS_COUNT.WR[UNIT6]                                 0
UNC_M_CAS_COUNT.WR[UNIT7]                                 0
UNC_Q_TxL_FLITS_G0.DATA[UNIT0]                     47307512
UNC_Q_TxL_FLITS_G0.DATA[UNIT1]                   5329582928
UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT0]                 44585656
UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1]               5326195528
UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT0]                 22536232

UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT1]               5195936072

McCalpinJohn · ‎04-19-2016

The UNC_M_CAS_COUNT.WR counts are certainly higher than expected. I suspect that this is due to writes to the "directory" (used to filter cache coherence messages). There is not a lot of documentation on Intel's implementation of the directory filters, but in the Xeon E7/E5 v3 Uncore Performance Monitoring Reference Manual, the description of the DIRECTORY_UPDATE event in the Home Agent says that directory updates result in writes to the memory controller. I don't know what sort of directory protocol Intel is using, but one directory update for each RFO seems reasonable, and that would match the observed counts.

Some of the QPI results make sense. Remember that in the remote case socket 0 is transmitting its writebacks to socket 1, so the two transmit values (UNC_Q_TxL_FLITS_G0.DATA[UNIT1] and UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1]) both correspond to the expected number of cache lines written back. The value for received data is 1/2 of what I would have expected -- I expected to see both the allocates and the reads here (total of 1.31 billion lines), but the value only corresponds to 1/2 this value. I have seen receive-side QPI counts of 1/2 the expected values on the Xeon E5 v3 parts I have tested, but I don't have enough info to be confident that I understand what is going on here, or that the factor of 2 discrepancy that I have seen on the Xeon E5 v3 parts has any relation to this factor of 2 discrepancy on the Xeon E7 v3 parts.

wang_c_ · ‎06-23-2016

Sorry for responding so late.
Thank you very much!
I think what you said about the DIRECTORY_UPDATE is right.
I use the same program to test the PMU event: UNC_H_DIRECTORY_UPDATE.ANY
You can see form the raw data below,
the UNC_H_DIRECTORY_UPDATE.ANY causes 631M times memory write.
This value is almost the same as extra memory write.
So, I think what you said is reasonable.

amplxe-cl -collect-with runsa -knob event-mode=all -knob event-config=UNC_M_CAS_COUNT.RD,UNC_M_CAS_COUNT.WR,UNC_Q_TxL_FLITS_G0.DATA,UNC_Q_TxL_FLITS_G1.DRS_DATA,UNC_Q_RxL_FLITS_G1.DRS_DATA,OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM,OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM,UNC_H_DIRECTORY_UPDATE.ANY -- numactl --physcpubind=0-7 --membind=1 ./test.o

Event summary

-------------

Hardware Event Type                        Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample

----------------------------------  --------------------------------  -----------------

CPU_CLK_UNHALTED.REF_TSC                                      482760724140    241380  2000003

OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=LLC_MISS.ANY_DRAM  655319659   6553  100003

OFFCORE_RESPONSE:request=DEMAND_RFO:response=LLC_MISS.ANY_DRAM      655119653   6551  100003

Uncore Event summary

--------------------

Uncore Event Type                   Uncore Event Count:Self

----------------------------------  -----------------------

UNC_M_CAS_COUNT.RD[UNIT0]                        1358269424

UNC_M_CAS_COUNT.RD[UNIT1]                           5833966

UNC_M_CAS_COUNT.RD[UNIT2]                                 0

UNC_M_CAS_COUNT.RD[UNIT3]                                 0

UNC_M_CAS_COUNT.RD[UNIT4]                          31079642

UNC_M_CAS_COUNT.RD[UNIT5]                                 0

UNC_M_CAS_COUNT.RD[UNIT6]                                 0

UNC_M_CAS_COUNT.RD[UNIT7]                                 0

UNC_M_CAS_COUNT.WR[UNIT0]                        1350616021

UNC_M_CAS_COUNT.WR[UNIT1]                           2825785

UNC_M_CAS_COUNT.WR[UNIT2]                                 0

UNC_M_CAS_COUNT.WR[UNIT3]                                 0

UNC_M_CAS_COUNT.WR[UNIT4]                          26564904

UNC_M_CAS_COUNT.WR[UNIT5]                                 0

UNC_M_CAS_COUNT.WR[UNIT6]                                 0

UNC_M_CAS_COUNT.WR[UNIT7]                                 0

UNC_H_DIRECTORY_UPDATE.ANY[UNIT0]                 661957801  //Cause 631M times memory write

UNC_H_DIRECTORY_UPDATE.ANY[UNIT1]                     73026

UNC_Q_TxL_FLITS_G0.DATA[UNIT0]                     40128728

UNC_Q_TxL_FLITS_G0.DATA[UNIT1]                   5305102952

UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT0]                 36786576

UNC_Q_TxL_FLITS_G1.DRS_DATA[UNIT1]               5301853792

UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT0]                 19212024

UNC_Q_RxL_FLITS_G1.DRS_DATA[UNIT1]               5192783016