Hi Roman and Fay,
Let me ask you a question about memory bandwidth. Using pcm-numa.x in PCM 2.8, I recorded the memory access counts for both local and remote while running stream benchmark, and found that there was a difference between peak bandwidth reported by stream and the maximum bandwidth based on the PCM recorded number. Actually, I calculated the maximum bandwidth by multiplying the largest memory access counts per second out of records by 64B. For example, 100M accesses (Local DRAM accesses + Remote DRAM accesses) is translated into 6.4GB/s throughput. Since the translated value is quite smaller than stream result (say, 24GB/s), I am wondering if I miss something for my PCM-based measurement. Could you tell me if LLC miss based counts used in PCM includes hardware memory prefetch events as well? If it does, do you have any thought on how I can interpret the difference?
Thanks for your reply in advance.
If you have root access on the system you can disable hardware prefetchers (as described at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processo...; This should improve the counts. I have not had a chance to check to see if the bugs in these events in the Sandy Bridge processors have been fixed in Haswell EP -- I seem to recall that some of the counts are still wrong even with the hardware prefetchers disabled, but I don't have the details at hand.