First of all, I am not trying to profile, find a bottleneck and tune the application. Although the final goal is similar to that.
Probably I am not the first one who trying to solve a problem described below, relevant links and references are appreciated (not much I could find with google).
What I am trying to do is to build a quite precise model of application execution on CPU and understand memory subsystem utilization. The question that should to be answered by this model is straight forward (and seems quite simple, but I don't know how to find it out)
1. What is DATA and CODE traffic in memory subsystem (L1I/L1D <-> L2; L2 <->LLC; LLC <->DRAM) .
2.What is proportion of DATA and CODE in QPI traffic (it is interesting, but rather has second priority)
The first what I tried is to reproduce technique described in section B.2.3.6 in Intel Arch optimization manual.
The per-socket read bandwidth can be measured with the events:
The total read bandwidth for all sockets can be measured with the event:
The per-socket non-temporal store bandwidth can be measured with the events:
The total non-temporal store bandwidth can be measured with the event:
Unfortunately there is no direct mapping between those offcore events to offcore events available on Sandy Bridge.
Name are almost similar, but without good documentation it is difficult to tell what has changed in semantics.
Loads are good start. Events that has LLC_MISS (not all as seems there is some overlay) was chosen. Data was collected for them at system granularity level.
List of events:
Clearly in such case it is required to validate result with data collect thought alternative monitoring unit.
Uncore memory BW measured via pcm-memory is the best candidate.
So those numbers should be closed (SUM_OF_COUNTERS_VALUE* 64) and (MEMORY_READ_BW - MEMORY_READ_IO_BW)
Here MEMORY_READ_IO_BW is just outbound PCIe traffic for the system. 64 bytes is a size of transfer.
The data for the experiment is below (counters values here are number of events per second)
SUM = 19660589
BW = SUM * 64 / 1024 /1024 = 1200 MB/sec
Read Throughput(MB/s): 1681.88
Write Throughput(MB/s): 1761.37
Memory Throughput(MB/s): 3443.25
IO BW is measured via external tool
Read IO BW(MB/s): 896.24
Validation step (difference is about 50%)
1200MB/s != 1682MB/s - 897MB/s
Why? Probably wrong set of counters is used. Then what else?
There are other counters that potentially can be used, but question what they really count.
As conclusion. Main questions are
1. What events can be used in order to measure CODE and DATA load traffic on Sandy Bridge for core and socket?
2. What wrong in the model that was describe above?
3. How to measure store traffic for DATA on Sandy-bridge. Since IO traffic is quite high in our system it isn preferable to use OFFCORE events as UNCORE includes IO?
4. With Haswell it might be easy to work with uncore and filter out PCIe traffic, but what about approach for offcore.