Solved: Writeback in DRAM

md_K_ · ‎06-30-2014

Hi

I am looking for the hardware performance counters to measure the number of writebacks in DRAM (dirty lines evicted from LLC). I haven't found anything for sandybridge/ivybridge system. OFFCORE_RESPONSE_0:WB measures the writeback in LLC (dirty lines evicted from L2).

Any chance it might be possible to measure in other architectures (atom, etc.). I will really appreciate any help. Thanks

McCalpinJohn · ‎07-01-2014

Linux perf support for uncore counters is incredibly difficult to track....

I think that the events mentioned above (e.g., UNC_ARB_TRK_REQUESTS) are for the "client" parts, not the Xeon E5-26xx series.

My Xeon E5-2680 systems are running RHEL 6.4, which has backported some of the uncore performance counters from much later (3.x) kernels. The perf subsystem knows about the uncore CBo counters -- the /sys/bus/event_source/devices/ directory contains 20 sub-directories starting with the names "uncore_", but only the "uncore_imc_*" and "uncore_qpi_*" have events defined by name (in their "events" subdirectories). So I can access the CBo counters using commands like
perf stat -e "uncore_cbox_0/event=0x00,umask=0x00/"
but I have to explicitly program each field.

I have not yet been able to work through the CBo counter documentation in enough detail to find which event(s) correspond to LLC writebacks.

The Home Agent (HA) provides the interface between the ring and the Memory Controllers. Event 0x01, Umask 0x0C REQUESTS.WRITES will count all incoming write requests to the Home Agent. This includes streaming stores, writebacks of dirty lines, intervention requests that hit modified data and force a writeback, and anything else that causes a write to memory. I have not compared these counts to the corresponding read and write values from the Memory Controller counters.

The iMC (integrated Memory Controller) counters are supported by "perf stat" on my system with a couple of pre-defined events -- including the critical ones "cas_count_read" (corresponding to iMC Event 0x04/Umask 0x03: CAS_COUNT.RD) and "cas_count_write" (corresponding to iMC Event 0x04/Umask 0x0C: CAS_COUNT.WR). I have looked at these pretty closely and they appear to be accurate. Since these counters are at the DRAMs they count all traffic, including streaming stores and IO traffic to system memory, so the values will always be higher than the L3 writeback traffic.

View solution in original post

McCalpinJohn · ‎06-30-2014

These events are not available from the core performance counters, but can be counted using the uncore counters on the "Sandy Bridge EP" and "Ivy Bridge EP" processors (Xeon E5-26xx and Xeon E5-26xx v2). The CBo counters should be able to measure this directly, or the iMC counters will provide a good approximation (if there are no streaming stores and only small amounts of IO DMA traffic).

These are discussed in Intel's documents 327043 (Xeon E5-2600 uncore performance monitoring guide) and 329468 (Xeon E5-2600 v2 uncore performance monitoring guide).

There are some references to uncore performance monitoring for the "client" versions of the Sandy Bridge (and presumably also Ivy Bridge) processors, but I can't find a particularly coherent summary. In Vol 3 of the SW developer's guide (325384-049) section 18.9.6 discusses uncore performance monitoring on these parts, and section 35.8.1 (Table 35-15) describes the MSRs used for the interface. Intel's VTune product uses CBO performance monitors on these parts and it looks like some of the events count LLC writebacks, but I don't have a system to test these on any more.

Patrick_F_Intel1 · ‎06-30-2014

Hello md K,

Can you see if you have the events mentioned in this thread? https://software.intel.com/en-us/forums/topic/389095

This should cover what you are looking for, but at the uncore level. You won't be able to say which core generated the traffic.

Pat

md_K_ · ‎06-30-2014

Thanks John and Patrick. I will check the Intel references.

Somehow, I don't see the events

UNC_ARB_TRK_REQUESTS, UNC_CBO_CACHE_LOOKUP working. I have been checking with few systems --

Xeon E5-2620 (Sandybridge), Linux 3.8.2 libpfm 4.x

Xeon E5-1607 (Sandybridge), Linux 3.13 libpfm 4.x

Xeon E5-2650 (Ivybridge), Linux 3.13 libpfm 4.x

Thanks again for help.

Patrick_F_Intel1 · ‎06-30-2014

I see that some systems name the UNC_ARB_TRK_REQUESTS as UNC_ARB_TRK_REQUEST instead.
I'm not sure if libpfm supports uncore counters. I'm guessing that Linux perf supports them but I don't really know.

md_K_ · ‎07-01-2014

So, what are the best options to get these uncore counters in Linux. It looks like libpfm doesn't do this. I guess we can do it by using

MSR. But not getting a good/compact reference on how to setup the registers.

Patrick_F_Intel1 · ‎07-01-2014

Have you looked at Linux 'perf' utility? or you could try downloading (for free trial) Intel VTune.

McCalpinJohn · ‎07-01-2014

Linux perf support for uncore counters is incredibly difficult to track....

I think that the events mentioned above (e.g., UNC_ARB_TRK_REQUESTS) are for the "client" parts, not the Xeon E5-26xx series.

My Xeon E5-2680 systems are running RHEL 6.4, which has backported some of the uncore performance counters from much later (3.x) kernels. The perf subsystem knows about the uncore CBo counters -- the /sys/bus/event_source/devices/ directory contains 20 sub-directories starting with the names "uncore_", but only the "uncore_imc_*" and "uncore_qpi_*" have events defined by name (in their "events" subdirectories). So I can access the CBo counters using commands like
perf stat -e "uncore_cbox_0/event=0x00,umask=0x00/"
but I have to explicitly program each field.

I have not yet been able to work through the CBo counter documentation in enough detail to find which event(s) correspond to LLC writebacks.

The Home Agent (HA) provides the interface between the ring and the Memory Controllers. Event 0x01, Umask 0x0C REQUESTS.WRITES will count all incoming write requests to the Home Agent. This includes streaming stores, writebacks of dirty lines, intervention requests that hit modified data and force a writeback, and anything else that causes a write to memory. I have not compared these counts to the corresponding read and write values from the Memory Controller counters.

The iMC (integrated Memory Controller) counters are supported by "perf stat" on my system with a couple of pre-defined events -- including the critical ones "cas_count_read" (corresponding to iMC Event 0x04/Umask 0x03: CAS_COUNT.RD) and "cas_count_write" (corresponding to iMC Event 0x04/Umask 0x0C: CAS_COUNT.WR). I have looked at these pretty closely and they appear to be accurate. Since these counters are at the DRAMs they count all traffic, including streaming stores and IO traffic to system memory, so the values will always be higher than the L3 writeback traffic.

md_K_ · ‎07-01-2014

Thanks a lot John. I haven't used perf a lot. In my systems, I do not see any IMC events (CAS_COUNT.RD, CAS_COUNT.WR, etc). When I do "perf list" it shows the common hardware, software, hardware cache, kernel PMU events. It looks like I can read the raw hardware event descriptor in the format of

rNNN [Raw hardware event descriptor]
cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]

I tried "perf stat -e rC04 " (Umask 0x0C, Event 0x04) to measure CAS_COUNT.WR. I get some numbers, but they don't look correct (less than what I expect). Am I missing something or I need something else to read the iMC counters.

thanks again.

McCalpinJohn · ‎07-02-2014

To get "perf" to access the counters in the uncore you need to specify the uncore unit in the descriptor.

There are a couple of "gotcha's", so I will include a complete working script. This includes both the correct programming of the iMC performance counter events and shows how to get "perf stat" to read the counters using only one core on each socket. (Note that the specific core numbers may need to be different on your system -- the important thing is to only use one core per socket, otherwise perf will add the numbers together, which is wrong since all the cores would be reading the same counters).

Obviously the script is set up to run a particular configuration of the STREAM benchmark, which you can replace with whatever code you are interested in. In this case I compiled a single-threaded version of STREAM with "-DSTREAM_ARRAY_SIZE=10000000 -DNTIMES=100" and computed the expected memory traffic based on the assumption that all the data would miss in the caches and that streaming stores would be generated by the compiler (icc, in this case). If you compile with gcc you will not get streaming stores, so the read traffic will go up by an amount equal to the write traffic (since each array will be read from DRAM into the cache before being overwritten).

#!/bin/bash

# define the Integrated Memory Controller performance counter event sets to 
# measure the four events:
#    All Read CAS operations				Event 0x04, Umask 0x03
#    All Write CAS operations				Event 0x04, Umask 0x0C
#    All Page Open (ACTIVATE) operations	Event 0x01, Umask 0x00
#    All Page Miss (i.e., conflict) events	Event 0x02, Umask 0x01
# Each of the four "SET" variables below includes one of these events on all 
# four memory controller channels.

SET1='-e "uncore_imc_0/event=0x04,umask=0x03/" -e "uncore_imc_1/event=0x04,umask=0x03/" \
      -e "uncore_imc_2/event=0x04,umask=0x03/" -e "uncore_imc_3/event=0x04,umask=0x03/"'
SET2='-e "uncore_imc_0/event=0x04,umask=0x0c/" -e "uncore_imc_1/event=0x04,umask=0x0c/" \
      -e "uncore_imc_2/event=0x04,umask=0x0c/" -e "uncore_imc_3/event=0x04,umask=0x0c/"'
SET3='-e "uncore_imc_0/event=0x01,umask=0x00/" -e "uncore_imc_1/event=0x01,umask=0x00/" \
      -e "uncore_imc_2/event=0x01,umask=0x00/" -e "uncore_imc_3/event=0x01,umask=0x00/"'
SET4='-e "uncore_imc_0/event=0x02,umask=0x01/" -e "uncore_imc_1/event=0x02,umask=0x01/" \
      -e "uncore_imc_2/event=0x02,umask=0x01/" -e "uncore_imc_3/event=0x02,umask=0x01/"'

# Since the uncore counters are "per chip", I only need to read these on one core per chip.
# For all of TACC's systems, I know that cores 0 and 9 are on different chips, whichever
#   assignment scheme is used.
#
# "perf stat" flags:
#    "-a" counts for all processes (not just the process run under "perf stat")
#    "-A" tells perf to report results separately for each core, rather than summed
#    "-x ," tells perf to report results as a comma-separated list (easier to import 
#           into scripts or spreadsheets)
#    "-o file" directs the output of perf to a separate log file (rather than stdout)

# This script runs two tests - 
#     the first with memory allocated on chip 0 and the thread run on chip 0
#     the second run with memory allocated on chip 1 and the thread run on chip 0
#
# Expected values for cache line reads and writes for this binary are approximately:
#     100 iters * 6 reads/iter * 8 Bytes/element * 
#          10M elements / 64 Bytes/cacheline / 4 memory controllers = 187,500,000
#     100 iters * 6 writes/iter * 8 Bytes/element * 
#          10M elements / 64 Bytes/cacheline / 4 memory controllers = 125,000,000
#
# Read values are always a few percent higher due to TLB reload traffic, 
# while write values are typically pretty close
#  Typical values on a Xeon E5-2600 node at TACC are:
#     ~201,000,000 for reads (7.2% above expected)
#     ~132,000,000 for writes (5.6% above expected)
# With the memory binding used here, traffic on the other chip seldom exceeds 1% of the 
# nominal values, and is usually much lower

echo "----------------------------------------"
echo "Test 1: Memory on Node 0, Task on Core 0"
perf stat -o perf.out.imc.test1 -x , -a -A -C 0,9 $SET1 $SET2 $SET3 $SET4 \ 
        numactl --membind=0 --physcpubind=0 ./stream.snb.10M.100x

echo "----------------------------------------"
echo "Test 1: Memory on Node 1, Task on Core 0"
perf stat -o perf.out.imc.test2 -x , -a -A -C 0,9 $SET1 $SET2 $SET3 $SET4 \
        numactl --membind=1 --physcpubind=0 ./stream.snb.10M.100x

md_K_ · ‎07-02-2014

Thanks a lot John. It works!!!!! In my dual socket system, i see there are 8 entries (/sys/bus/event_source/devices/uncore_imc_0 to 7). I used the first 4 (imc 0 to 3) and the first two counters have positive value, the other two are zero. The numbers seem CORRECT. If i use the last four counters (imc4 to 7), perf gives error --

<not supported> "uncore_imc_4/event=0x04,umask=0x0c/"
<not supported> "uncore_imc_5/event=0x04,umask=0x0c/"
<not supported> "uncore_imc_6/event=0x04,umask=0x0c/"
<not supported> "uncore_imc_7/event=0x04,umask=0x0c/"

Few clarifications --

1. If I do '-a -A' no '-C' ..... it still reports results for two cores (core 0, core 8) -- one from each socket. Does it pick one core from each socket automatically? I though it would report all 16 cores (same counter values)

2. Any correlation with whether the dimm slots are filled or empty. I have 16 dimm slots, and 4 of them are occupied (2 DRAM sticks per socket).

3. Is there any way to get these values from inside a program? For example, I want to measure it for a particular loop. I did it using libpfm by starting/stopping the counters. Is there a similar interface?

thanks

McCalpinJohn · ‎07-03-2014

I am guessing that the IMC 4/5/6/7 devices are showing up on your Ivy Bridge box? The uncore performance monitoring guide for the Xeon E5-2600 v2 mentions a second memory controller, but it does not actually seem to exist (at least not on these parts). Channels 0-3 are the ones that are supposed to be working, and they should be the only ones visible on the Sandy Bridge boxes.

I don't know enough about different versions of "perf" to know whether they have made it smart enough to understand that it only needs to read the chip-level counters once on each chip. The version on my systems is not that smart. If you don't need any extra options to get it to do the right thing, that is great.

The counters don't have the ability to filter accesses by rank (at least I don't think they do), so you will get counts for all DIMMs on any channel that has DIMMs installed.

I was able to read the counters inside the program by running as root and opening the appropriate device files in the /proc/bus/pci/ tree. The bus numbers corresponding to the two sockets are set by the BIOS and are not always consistent. I have seen systems using buses 1f & 3f, systems using 3f &7f, and systems using 7f & ff. Once the files are open, you can use "pread" with the appropriate offset to read the upper and lower 32 bits of the counter values. You can, of course, also program the counters this way if you open the files with write permission. In my case I don't usually need to do this because our batch system automatically programs the counters to record the base set of events that I want to use, so I only need to read the counter values. I don't ever bother to stop the counters -- I just read values before and after the region of interest.