OFFCORE events for CODE/DATA traffic measurements on Sandy Bridge

Alexander_Alexeev · ‎11-06-2014

Hello

First of all, I am not trying to profile, find a bottleneck and tune the application. Although the final goal is similar to that.
Probably I am not the first one who trying to solve a problem described below, relevant links and references are appreciated (not much I could find with google).

What I am trying to do is to build a quite precise model of application execution on CPU and understand memory subsystem utilization. The question that should to be answered by this model is straight forward (and seems quite simple, but I don't know how to find it out)

1. What is DATA and CODE traffic in memory subsystem (L1I/L1D <-> L2; L2 <->LLC; LLC <->DRAM) .

2.What is proportion of DATA and CODE in QPI traffic (it is interesting, but rather has second priority)

The first what I tried is to reproduce technique described in section B.2.3.6 in Intel Arch optimization manual.

The per-socket read bandwidth can be measured with the events:
OFFCORE_RESPONSE_0.DATA_IFETCH.L3_MISS_LOCAL_DRAM
OFFCORE_RESPONSE_0.DATA_IFETCH.L3_MISS_REMOTE_DRAM
The total read bandwidth for all sockets can be measured with the event:
OFFCORE_RESPONSE_0.DATA_IFETCH.ANY_DRAM
The per-socket non-temporal store bandwidth can be measured with the events:
OFFCORE_RESPONSE_0.OTHER.L3_MISS_LOCAL_CACHE_DRAM
OFFCORE_RESPONSE_0.OTHER.L3_MISS_REMOTE_DRAM
The total non-temporal store bandwidth can be measured with the event:
OFFCORE_RESPONSE_0.OTHER.ANY.CACHE_DRAM

Unfortunately there is no direct mapping between those offcore events to offcore events available on Sandy Bridge.
Name are almost similar, but without good documentation it is difficult to tell what has changed in semantics.

Loads are good start. Events that has LLC_MISS (not all as seems there is some overlay) was chosen. Data was collected for them at system granularity level.

List of events:

OFFCORE_RESPONSE.DEMAND_CODE_RD.LLC_MISS.ANY_RESPONSE_0
OFFCORE_RESPONSE.DEMAND_DATA_RD.LLC_MISS.ANY_RESPONSE_1
OFFCORE_RESPONSE.PF_L2_CODE_RD.LLC_MISS.ANY_RESPONSE_0
OFFCORE_RESPONSE.PF_L2_DATA_RD.LLC_MISS.ANY_RESPONSE_1
OFFCORE_RESPONSE.PF_LLC_CODE_RD.LLC_MISS.ANY_RESPONSE_0
OFFCORE_RESPONSE.PF_LLC_DATA_RD.LLC_MISS.ANY_RESPONSE_1

Clearly in such case it is required to validate result with data collect thought alternative monitoring unit.
Uncore memory BW measured via pcm-memory is the best candidate.
So those numbers should be closed (SUM_OF_COUNTERS_VALUE* 64) and (MEMORY_READ_BW - MEMORY_READ_IO_BW)
Here MEMORY_READ_IO_BW is just outbound PCIe traffic for the system. 64 bytes is a size of transfer.

The data for the experiment is below (counters values here are number of events per second)

OFFCORE_RESPONSE.DEMAND_CODE_RD.LLC_MISS.ANY_RESPONSE_0   460014
OFFCORE_RESPONSE.DEMAND_DATA_RD.LLC_MISS.ANY_RESPONSE_1   14320430
OFFCORE_RESPONSE.PF_L2_CODE_RD.LLC_MISS.ANY_RESPONSE_0   540016
OFFCORE_RESPONSE.PF_L2_DATA_RD.LLC_MISS.ANY_RESPONSE_1   2020061
OFFCORE_RESPONSE.PF_LLC_CODE_RD.LLC_MISS.ANY_RESPONSE_0   40001
OFFCORE_RESPONSE.PF_LLC_DATA_RD.LLC_MISS.ANY_RESPONSE_1   2280068

SUM = 19660589
BW = SUM * 64 / 1024 /1024 = 1200 MB/sec

PCM-Memory output
Read Throughput(MB/s):   1681.88
Write Throughput(MB/s):   1761.37
Memory Throughput(MB/s):   3443.25

IO BW is measured via external tool
Read IO BW(MB/s): 896.24

Validation step (difference is about 50%)
1200MB/s != 1682MB/s - 897MB/s

Why? Probably wrong set of counters is used. Then what else?

There are other counters that potentially can be used, but question what they really count.
OFFCORE_RESPONSE.ALL_DEMAND_MLC_PREF_READS.LLC_MISS.ANY_RESPONSE_0
OFFCORE_RESPONSE.ALL_DEMAND_MLC_PREF_READS.LLC_MISS.LOCAL_DRAM_0
OFFCORE_RESPONSE.DEMAND_CODE_RD.LLC_MISS.LOCAL_DRAM_1
OFFCORE_RESPONSE.DEMAND_DATA_RD.LLC_MISS.LOCAL_DRAM_0
OFFCORE_RESPONSE.PF_L2_DATA_RD.LLC_MISS.LOCAL_DRAM_1

As conclusion. Main questions are
   1. What events can be used in order to measure CODE and DATA load traffic on Sandy Bridge for core and socket?
   2. What wrong in the model that was describe above?
   3. How to measure store traffic for DATA on Sandy-bridge. Since IO traffic is quite high in our system it isn preferable to use OFFCORE events as UNCORE includes IO?
   4. With Haswell it might be easy to work with uncore and filter out PCIe traffic, but what about approach for offcore.

Thanks,
Alexander