Solved: Interleaved data placement is

Jacob_K_ · ‎08-11-2015

I am testing performance on a 2 Numa node system (intel xeon e5-2620 on each socket).
the test runs 12 threads (6 on each node) and they access a shared memory.
i run this once when all memory in allocated on a specific node and once when memory is interleaved between the nodes.
the result is that in interleaved mode the test runs faster.
i thought i'd check how much data is actually transfered between the nodes and maybe it would explain why interleaved is faster.
i used OFFCORE_REQUESTS.DEMAND_DATA_RD event (r1B0) but it showed pretty much the same result for both policies.

Does anyone know what events should i check that might explain this result?

McCalpinJohn · ‎08-11-2015

Interleaved data placement is expected to be faster than single-chip data placement for threaded workloads (unless the memory bandwidth requirement is near zero).

When you are running with interleaved data, the threads have twice as much DRAM bandwidth that they can draw from, which will help until the bandwidth demands run into the limits of what the QPI interface can transfer.

In the both the single-socket-placement and interleaved-placement scenarios, half of the data traffic has to cross the QPI interface. But in the interleaved case you are guaranteed that the traffic is symmetric (assuming the threads all behave the same way), which maximizes the available QPI bandwidth. In the case of single-socket placement, the QPI interface is loaded asymmetrically, depending on the read/write ratios of the threads running on the "remote" socket.

A proper analysis requires that you understand both the data movement through the system *and* the ability of the code to tolerate load imbalance.

If your system supports the QPI Link Layer counters, Event 0x02, Umask 0x08 "RxL_FLITS_G1.DRS_DATA" will provide a count of coherent data traffic *received* by each chip on each link. If you have a recent Linux distribution, you can get whole-program measurements with a command like:

perf stat -a -A -C 0,11 -e "uncore_qpi_0/event=drs_data/" -e "uncore_qpi_1/event=drs_data/" a.out

"perf stat" flags:

"-a" counts for all processes (not just the process run under "perf stat")
"-A" tells perf to report results separately for each core, rather than summed
"-C 0,11" tells perf to only read the counters using core 0 and core 11.
- On a 2-socket system with 12 cores (assuming HyperThreading is disabled), cores 0 and 11 will be on different chips whether the cores are numbered in an alternating or contiguous order.

You will probably want memory controller counters on each chip as well to help complete the data flow map. With "perf stat", you can get this data by adding the additional event requests:

-e "uncore_imc_0/event=0x04,umask=0x03/" -e "uncore_imc_1/event=0x04,umask=0x03/" -e "uncore_imc_2/event=0x04,umask=0x03/" -e "uncore_imc_3/event=0x04,umask=0x03/"

-e "uncore_imc_0/event=0x04,umask=0x0c/" -e "uncore_imc_1/event=0x04,umask=0x0c/" -e "uncore_imc_2/event=0x04,umask=0x0c/" -e "uncore_imc_3/event=0x04,umask=0x0c/"

The first set of events programs the memory controller counters to count cache line reads on each of the four DRAM channels. The second set of events programs the memory controller counters to count cache line writes on each of the four DRAM channels. If you add this to the "perf stat" command above, these counters will be read by core 0 on socket 0 and by core 11 on socket 1.

If you can instrument your code to monitor spin-waiting overhead due to load imbalance, it makes the analysis easier.

View solution in original post

McCalpinJohn · ‎08-11-2015

Interleaved data placement is expected to be faster than single-chip data placement for threaded workloads (unless the memory bandwidth requirement is near zero).

When you are running with interleaved data, the threads have twice as much DRAM bandwidth that they can draw from, which will help until the bandwidth demands run into the limits of what the QPI interface can transfer.

In the both the single-socket-placement and interleaved-placement scenarios, half of the data traffic has to cross the QPI interface. But in the interleaved case you are guaranteed that the traffic is symmetric (assuming the threads all behave the same way), which maximizes the available QPI bandwidth. In the case of single-socket placement, the QPI interface is loaded asymmetrically, depending on the read/write ratios of the threads running on the "remote" socket.

A proper analysis requires that you understand both the data movement through the system *and* the ability of the code to tolerate load imbalance.

If your system supports the QPI Link Layer counters, Event 0x02, Umask 0x08 "RxL_FLITS_G1.DRS_DATA" will provide a count of coherent data traffic *received* by each chip on each link. If you have a recent Linux distribution, you can get whole-program measurements with a command like:

perf stat -a -A -C 0,11 -e "uncore_qpi_0/event=drs_data/" -e "uncore_qpi_1/event=drs_data/" a.out

"perf stat" flags:

"-a" counts for all processes (not just the process run under "perf stat")
"-A" tells perf to report results separately for each core, rather than summed
"-C 0,11" tells perf to only read the counters using core 0 and core 11.
- On a 2-socket system with 12 cores (assuming HyperThreading is disabled), cores 0 and 11 will be on different chips whether the cores are numbered in an alternating or contiguous order.

You will probably want memory controller counters on each chip as well to help complete the data flow map. With "perf stat", you can get this data by adding the additional event requests:

-e "uncore_imc_0/event=0x04,umask=0x03/" -e "uncore_imc_1/event=0x04,umask=0x03/" -e "uncore_imc_2/event=0x04,umask=0x03/" -e "uncore_imc_3/event=0x04,umask=0x03/"

-e "uncore_imc_0/event=0x04,umask=0x0c/" -e "uncore_imc_1/event=0x04,umask=0x0c/" -e "uncore_imc_2/event=0x04,umask=0x0c/" -e "uncore_imc_3/event=0x04,umask=0x0c/"

The first set of events programs the memory controller counters to count cache line reads on each of the four DRAM channels. The second set of events programs the memory controller counters to count cache line writes on each of the four DRAM channels. If you add this to the "perf stat" command above, these counters will be read by core 0 on socket 0 and by core 11 on socket 1.

If you can instrument your code to monitor spin-waiting overhead due to load imbalance, it makes the analysis easier.

TimP · ‎08-11-2015

In order to take advantage of numa mode, you must arrange your application so that most of the memory accesses are to local memory, taking advantage of first touch placement and thread affinity.

I think part of this depends on what you mean by your "shared memory." For numa mode to work, you would allow blocks of shared memory to be allocated local to where they will be accessed. I guess your meaning is that you don't intend this.

McCalpinJohn · ‎08-11-2015

Since you have the offcore response events working, you might try these four from Table 19-14 of Volume 3 of the Intel SW Developer's Manual:

OFFCORE_RESPONSE.PF_L2_DATA_RD.LLC_MISS.REMOTE_DRAM_N

OFFCORE_RESPONSE.DEMAND_DATA_RD.LLC_MISS.REMOTE_DRAM_N

OFFCORE_RESPONSE.PF_L2_DATA_RD.LLC_MISS.LOCAL_DRAM_N

OFFCORE_RESPONSE.DEMAND_DATA_RD.LLC_MISS.LOCAL_DRAM_N

These events will not capture writeback traffic, but should get all the read traffic.

Jacob_K_ · ‎11-08-2015

my /sys/devices/ directory doesn't contain any uncore_* files.

how can i install/add these files? is there away to measure the QPI load without these devices?

McCalpinJohn · ‎11-09-2015

If /sys/devices/ does not contain any uncore devices then there is nothing you can do about it except upgrade the OS and hope that the BIOS does not prevent the OS from accessing and enumerating these devices. This does not require a particularly new OS, on the Xeon E5-26xx (Sandy Bridge EP) processors you can get this support in CentOS 6.4 or later.

On one of my Dell Xeon E5-2680 (Sandy Bridge EP) systems running CentOS 6.5 (kernel 2.6.32-431.17.1.el6.x86_64), the OS sets up an almost complete set of interfaces to the uncore devices:

$ ls -dl /sys/devices/uncore_*
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_cbox_0
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_cbox_1
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_cbox_2
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_cbox_3
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_cbox_4
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_cbox_5
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_cbox_6
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_cbox_7
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_ha
drwxr-xr-x 5 root root 0 Nov 9 12:25 /sys/devices/uncore_imc_0
drwxr-xr-x 5 root root 0 Nov 9 12:25 /sys/devices/uncore_imc_1
drwxr-xr-x 5 root root 0 Nov 9 12:25 /sys/devices/uncore_imc_2
drwxr-xr-x 5 root root 0 Nov 9 12:25 /sys/devices/uncore_imc_3
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_pcu
drwxr-xr-x 5 root root 0 Nov 9 12:25 /sys/devices/uncore_qpi_0
drwxr-xr-x 5 root root 0 Nov 9 12:25 /sys/devices/uncore_qpi_1
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_r2pcie
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_r3qpi_0
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_r3qpi_1
drwxr-xr-x 4 root root 0 Nov 9 12:25 /sys/devices/uncore_ubox

The QPI devices are a special case -- many systems have a BIOS that disables these by default, and a BIOS upgrade is required to enable them.

Unfortunately some systems have misbehaved BIOS implementations that prevent the OS from finding any of these PCI configuration space devices. I have heard rumors that some newer versions of Linux can work around this problem, but I have tried this on any of my systems. We are installing a new system from the problematic vendor right now, so I might be investigating this again soon.

QPI link counters