Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Sandy\Ivy Bridge memory traffic

Max_R_
Beginner
682 Views

Hello everybody.

Recently, I have faced the problem of measuring the traffic between last level cache and main memory in my project. Basically, I have to measure the number of cache lines transferred between LLC and memory for _one_ core.

As far as I understood, on Core 2 Duo architecture, this problem can be solved by just using L2_LINES_IN:SELF (traffic from memory to L2 cache) and L2_M_LINES_OUT:SELF (writebacks from L2 to memory) events. However, Sandy\Ivy Bridge microarchitectures are quite different from Core 2, because the LLC now is Level3 cache and I haven't found any similar events like L3_LINES_IN:SELF or something like that.

Is there any way to measure the memory traffic properly on these architectures?

PS. I am using Linux Ubuntu 11.10 (3.0.0.12) and libpfm4.

0 Kudos
10 Replies
TimP
Honored Contributor III
682 Views
According to http://perfmon2.sourceforge.net/ the uncore events were first supported in libpfm 4.3, so you can't expect to do it with earlier versions. I don't know how you'd go about reading the docs specifically for that version of libpfm. There's a little more published about how it's done with VTune. If you've read any of the docs you may have seen that the events can't be associated with a specific core.
0 Kudos
Max_R_
Beginner
682 Views
Hello, Tim. Thanks for your reply. I have been studying performance events references, libpfm\vtune documentations and a couple of intel manuals, but still a feel a bit mazed. As far as I understood, if want to measure all the memory traffic for snb (for all cores, both reads and writes) I have to take into account the following: Traffic from memory to LLC, bytes = 64 * (demand requests + L1D prefetcher's requests + L2 cache prefetcher's requests) Traffic from LLC to memory, bytes = 64 * writebacks from LLC + 8 * non-temporal stores 64 is the cacheline length in bytes. Thus, for counting the number of read cachelines I am going to use the following events: OFFCORE_RESPONSE_0.DMND_DATA_RD.LLC_MISS_LOCAL.SNP_MISS - Demand requests + L1D prefetch requests OFFCORE_RESPONSE_0.PF_DATA_RD.LLC_MISS_LOCAL.SNP_MISS - L2 prefetch requests For writes: OFFCORE_RESPONSE_0.WB.ANY_RESPONSE - writebacks OFFCORE_RESPONSE_0.STRM_ST.ANY_RESPONSE - stream writes Am I doing things right? Please confirm or disprove.
0 Kudos
McCalpinJohn
Honored Contributor III
682 Views
(1) Since you are trying to measure the traffic associated with a single core, the offcore response counters you are using are the only possible approach. I don't think that they can provide all the data that you need, but I might be wrong.... (2) The best reference I have seen from inside Intel is http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf This was written for Nehalem, but the capability of the offcore response counters on Sandy Bridge is mostly a superset of the capability on Nehalem, so this should be helpful. I am not sure how many of the offcore response selection bits actually work. If you look at the predefined events in the Intel Arch SW Developer's Manual Volume 3 (document 325384, rev 042), Tables 19-4 and 19-5 list some specific offcore response events that should work. I have tried lots of other combinations that I thought should work, but have gotten back a lot of zeros. Either I don't understand how to build the right bit combinations, or some of them don't work. None of the predefined events include either Writebacks or Streaming Stores. Note that the "Writeback" selection in the offcore response event might be referring to L2 writebacks to L3, not L3 writebacks to memory. I can't tell for sure because I have not been able to get it to work. (3) Streaming stores will usually (almost always) be collected into 64 Byte blocks, so your count of traffic to memory should probably be 64 Bytes per offcore transaction (assuming that you can get the event to work). The Nehalem performance analysis guide (link above) mentions (top of page 37) that streaming stores to local memory are counted with a different mask than one might expect.
0 Kudos
Max_R_
Beginner
682 Views
Hi jdmccalpin, thanks for your reply. 1. Not quite, actually now I am trying to measure the traffic generated by all the cores, because it seems like measuring it for one core only is not really possible. 2. I have similar issue with those offcore_response.* events. I always get zero, no matter what request\response I choose. I am using libpfm4.3 library as an interface to the counters. OS is Linux Ubuntu 11.10 (kernel 3.0). Somewhere on the web I have found out that in order these counters to work properly, the kernel version must be >= 3.3. I will try to update, see what happens and post the results here, if I achieve something. Btw, what setup are you using for measurements? (OS, kernel, etc)
0 Kudos
McCalpinJohn
Honored Contributor III
682 Views
Hi Max, If you are looking for the memory traffic generated by all the cores, then there is support in some platforms -- specifically the Xeon E5-2600 series. (Probably the Xeon E5-4xxx too, but unfortunately not the Xeon E3 processors or the corresponding Core i7's.) Intel has written up an extensive guide called "Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring Guide", document 327043. I am using revision 001 from March 2012. The core performance counters cannot measure memory traffic, but there are performance counters in several different "boxes" in the uncore (described in the document). Lots of the memory traffic can be obtained (somewhat indirectly) from the L3 Coherence boxes, but I usually just read directly from the memory controller performance counters... (1) Some of these counters are accessed via MSRs -- you can read/write these (as root) using "rdmsr.c" and "wrmsr.c" from the "msr-tools-1.2" package. I usually program the counters in a shell script outside my program and if I need to read the counters inside the program I just copied a few lines of code from "rdmsr.c" to open one of the MSR device driver files on each chip (/dev/cpu/*/msr) and read the corresponding MSR using "pread()". (2) Other counters are accessed via PCI configuration space. These can be accessed (as root) using "setpci" and can be read using "lspci". I have not tried to make any of these work inline yet, but in principle it is possible. Right now I only accessing these counters in a shell script before and after my code runs. Both of these mechanisms are relatively independent of the kernel -- I do pretty much the same thing on CENTOS 5 systems (2.6.18 kernels) and RHEL 6 systems (2.6.32 kernels), but it should also work on newer kernels. It only required a little bit of tweaking to make this work on Xeon PHI (aka "MIC", "KNC", etc...) -- the MSR device driver files are named differently, so you need to change the file name that is opened -- no big deal. --- Caveat 1 --- Some kernels hide bits of the PCI configuration space that correspond to the processor configuration. If you run "lspci" you should see entries like: ... 3f:13.1 Performance counters: Intel Corporation Sandy Bridge Ring to PCI Express Performance Monitor (rev 07) ... If you don't have lots of such entries (most include the string "System peripheral: Intel Corporation Sandy Bridge"), then you may still be able to access the PCI configuration space, but you will have to do it the hard way --- reading and writing /dev/mem with the physical address offsets corresponding to the PCI configuration space bits that you need. This is a really unsafe way of working with the system, so it is only recommended if you have no other choice and you are willing to do lots of checking to make sure you don't ever write the wrong bits. --- Caveat 2 --- On some systems pieces of the PCI config space are hidden by the BIOS. In these cases even the /dev/mem hack does not work and you most likely need an updated BIOS. The only place I have run across this problem is the PCI config space area for the QPI link layer counters.
0 Kudos
Max_R_
Beginner
682 Views
Hello jdmccalpin, thanks for your reply. I have just updated the kernel and it didn't help, offcore_response.* still returns zero all the time. Apparently, you are right and the kernel has nothing to do with those counters. lspci does not output anything similar to "Performance Monitor", i guess I have to search for a solution further.
0 Kudos
Patrick_F_Intel1
Employee
682 Views
Hello Max R., On sandybridge, you can use the uncore events: UNC_ARB_TRK_REQUESTS.WRITES # works for rfo (read for ownership) and nontemporal stores. evt num 0x81, umask 0x20, uncore unit=ARB UNC_ARB_TRK_REQUESTS.EVICTIONS # works for wriiteback, evt num 0x81, umask= 0x80, uncore unit= ARB UNC_CBO_CACHE_LOOKUP.ANY_I # works for reads and rfo and nontemporal stores, evt num 0x34, umask 0x88, uncore unit= cbox These count full cache line transfers (so the number of bytes moved is 64 * event count). There is one 1 CBOX unit per core so you can get the memory reads per core. There is only 1 ARB unit per processor so you don't get the writebacks per core... just a total for the processor. The formula would be total memory bw due to the cores is = 64 * (UNC_ARB_TRK_REQUESTS.EVICTIONS + UNC_CBO_CACHE_LOOKUP.ANY_I ) / elapsed_time Pat
0 Kudos
Max_R_
Beginner
682 Views
Hello Pat. 1. Thanks a lot for your response. Unfortunately, I don't have regular access to the SandyBridge machine, but recently I have tried to use these counters and, unfortunately, I was not able to access them using libpfm 4.3. I tried it on Linux Ubuntu 11.10 (kernel 3.0) and I think that this kernel version doesn't support them. Do you have any idea - starting from which kernel version the snb uncore events are supported? 2. By the way, if somebody is interested in measuring the memory traffic on ivb, please read this: The read traffic can be measured using the following offcore events: ivb::OFFCORE_RESPONSE_0:ANY_DATA:LLC_MISS_LOCAL - counts the number of cachelines brought from ram because of the demand and prefetchers' requests. ivb::OFFCORE_RESPONSE_0:ANY_RFO:LLC_MISS_LOCAL - count the number of cachelines brougth from ram because of RFOs. So, the total read traffic is (ivb::OFFCORE_RESPONSE_0:ANY_DATA:LLC_MISS_LOCAL + ivb::OFFCORE_RESPONSE_0:ANY_RFO:LLC_MISS_LOCAL) * 64. Important: If you have kernel version <3.5, these counters will return zeros all the time, so if you want to use them, then you will have to upgrade the kernel to version 3.5 or newer. (at least, this solved the problem in my case) Unfortunately, I have no idea on how to measure the write traffic, as ivb appears to support no uncore events. So, if somebody knows the solution to this problem, please post it here.
0 Kudos
Shuja-ur-Rehman_B_
682 Views

Hi Max R.

Did you find the way of calculating write traffic?

 

0 Kudos
Reply