There have been already similar discussions, but I still didnt get some things clear.
I would like to measure memory bandwidth on SandyBridge E5-2670, dual socket machine.
I'm familiar with uncore counters and they are working well, but I wanted to measure mem bw per core, if it is possible at all.
I tried using OFFCORE_RESPONSE_x counters and I'm able to use and read them.
(I use PAPI library)
Similar as discussed in this thread: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring...
I measured read traffic as (OFFCORE_RESPONSE_x:ANY_DATA:LLC_MISS_LOCAL:LLC_MISS_REMOTE + OFFCORE_RESPONSE_x:ANY_RFO:LLC_MISS_LOCAL:LLC_MISS_REMOTE) * 64
I added also this LLC_MISS_REMOTE mask as well, so I get a bit higher values than with only LLC_MISS_LOCAL
I checked the summed measured values per core for several workloads, compared with uncore counters and it seems they are very close.
But I'm having a problem with write traffic. I tried to measure write traffic as:
(OFFCORE_RESPONSE_x:STRM_ST:ANY_RESPONSE + OFFCORE_RESPONSE_x:WB:ANY_RESPONSE) * 64
I think the component of streaming stores is useful, but writeback component is not accurate. Usually write traffic that I derive this way is equal or higher than the real write traffic to DRAM.
Can someone confirms what is measured in writebacks and does the response setup have any influence?
Is it possible to derive close to write traffic with offcore_response counters?
I don't know if the Offcore Response WriteBack event worked for any processor -- or even if it is theoretically possible to make it work....
The "offcore" in OFFCORE_RESPONSE refers to the interface between the L2 and the ring. For transactions that get a response, the event allows many filters, but a Writeback from L3 to memory has nothing to do with the interface between the core and the ring. It should be easy to measure Writebacks from L2 to the LLC, since those go through the agent, but there is no way to know whether those eventually turn into L3 to memory writebacks. It is possible that your formula overcounts write traffic because it counts all of the writebacks from L2 to L3 -- even if those never become L3 to memory writebacks.
Hi John McCalpin,
Thank you for the reply. I think now I understand better how these counters work.
I guess the bottom line is there is no way of measuring writebacks to DRAM per core, I will have to use uncore counters for total bandwidth measurements.