Hi John McCalpin,

Stephan_W_ · ‎01-19-2017

Hi,

I want to get the Main Memory Bandwidth for different scenarios on a large Cluster with dual sockel E5-2680 v3 nodes

1. I want to run a daemon that samples something like every second the main memory bandwidth over several weeks to get a feeling how main memory limited the applications on the cluster are.
2. I want to use EXTRAE and Score-P with the Perf Interface to get the main memory bandwidth for a specific application and function.

The problem is, that the machine use an old kernel version with an from my point of view outdated perf interface. So the uncore counters and also in /sys/devices/ the uncore counters are not exposed. I was only able to locate the iMC Performance Counters through lspci.

I already read the following topics, but they don't answer my questions:

https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/607524
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/684637
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/535072

1. I was not able to find the offsets for the different configuration and counter registers within the Intel Xeon Processor E5 and E7 v3 Family Uncore Performance Monitoring document. And they also stated within the document, that the counters are 48 bit wide and not 32 bit like it was stated in the first topic

2. In the third topic the usage of perf with raw counters is supposed as an valid solution, but for me it looks like, that for this solution it is required, that the uncore_imc_x folder is present within /sys/devices and therefore also not the event and mask "files". So as far as I understood the kernel, this means, that the kernel have not exported the iMCs as PCI devicescorrectly, even if they are present in lspci. My question is now, if I have to write a Kernel device driver to export the registers to the user within the /sys/devices path, or if there is another possibility to get perf working.

3. If it is possible to get perf working with raw counter access, it would be great to get custom perf events. Does anybody know how to add new perf events. For me it looks like perf use the /sys/devices/xy/this_is_my_custom_event/{event,uvent} structure to get the events. Is this right?

I have written in the past already a Kernel driver module that implemented a virtual bus system so that a user was able to modify the registers of a hardware device, but I require for this the kernel sources and so on to build the module against it. Another problem is, that I don't know the physical addresses that I have to map to user space. I have found nothing about the location of the registers and also lspci does not show any addresses.

So at the end of the day, I have really to know what have to be done to solve the problem, so that we can push our service provider to develop the required solution. From my point of view, it shouldn't be a too big task to get it working. I assume that it will not take more than one man month.

Any idea?

McCalpinJohn · ‎01-19-2017

A script to configure the iMC counters on a Xeon E5-2680 v3 is included in my forum post at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/564369#comment-1834049 ; Most of my 2-socket systems use buses 7f and ff, but some of the systems use buses 3f and 7f -- you will have to look at the output of "lspci" to see which buses your system uses for the uncore devices.

NOTE that with this script I am not using perf -- perf only works with raw counters if it understands the "box" that you are using. I am bypassing all of that and programming and reading the counters using ordinary shell commands.

The Xeon E5 v3 uncore performance monitoring guide is not easy to read (and it has a few mistakes), but the offsets are definitely all there. The script I referred to above uses offsets 0xD8, 0xDC, 0xE0, and 0xE4 for the control registers on each of the 8 DRAM channels in a two-socket node.

The counters are 48 bits wide in a 64-bit-aligned field, but you can only read 32 bits at a time from PCI configuration space, so the counters are described as being composed of two 32-bit fields. So to read programmable counter 0 on each channel, you would modify my script to have lines like:

       # low-order 32-bits of counter 0
       setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xA0.l
       # high-order 32-bits of counter 0
       setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xA4.l

I typically take these as ASCII hex values, concatenate them, and then convert to decimal using the shell's "printf" function. This is not particularly fast, but if I am only doing it before and after program execution the overhead is negligible.

If you read these fields using the /sys/devices/ driver, you can ignore the 32-bit read restriction and use "pread()" (or lseek() followed by read()) on larger sizes. Using this interface, I typically read all four counters with a single pread() (256 Bytes starting at offset 0xA0).

The physical addresses for these PCI configuration space functions are in a 256 MiB region that is usually referred to as "PCI MMCONFIG" in /proc/iomem. This region provides 4KiB of address space for every possible bus, device, and function, so it is pretty easy to compute the addresses once you know the bit field widths.

Stephan_W_ · ‎01-20-2017

Hi John McCalpin,

thank you very much for this fast and great answer. This really improved my understanding of the counters. I will test as soon as possible your script. I think this is a way to go to get the required information about the applications on the cluster.

But just one more question to this topic, since I have never used setpci. As far as I understood your post and the post in the link, you use setpci for writing and reading values without any option? So just make a small example with the values of my system:

The iMC0 registers for Channel 0 and 1 on CPU0 (2fbe and 2fb6). But there are also the devices 2fb6 and 2fb7 where I don't know what they are.

/sbin/\lspci -s 7f:15 -vv
7f:15.0 System peripheral: Intel Corporation Device 2fb4 (rev 02)
   Subsystem: Intel Corporation Device 2fb4
   Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

7f:15.1 System peripheral: Intel Corporation Device 2fb5 (rev 02)
   Subsystem: Intel Corporation Device 2fb5
   Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

7f:15.2 System peripheral: Intel Corporation Device 2fb6 (rev 02)
   Subsystem: Intel Corporation Device 2fb6
   Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

7f:15.3 System peripheral: Intel Corporation Device 2fb7 (rev 02)
   Subsystem: Intel Corporation Device 2fb7
   Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx

So If I just want to get the total size of memory accesses (RD+WR) than I can do this in the following way with the Table 2-114 that show how to calculate the value I have to write into the control registers:

# We want to see the total RD+WR on the Channel
# value is a 32Bit regiser 0:31 ordering
# 7:0       ev_sel      -> p. 113 0x04
# 15:8     umask      -> p. 116 b00001111 or 0x0F
#16         rsv             -> reserved must be 0
#17         rst              -> reset so 1 at start of the application
#18         edge_det -> I think this have to be 0
#19         ig               -> ignored so I will chose 0
#20         ov_en      -> At the moment I choose 0 but perhaps this is a useful feature
#21         rsv            -> reserved must be 0
#22         en             -> Local Counter Enable : 1
#23         invert        -> we don't want to invert the event so we set it to 0
#24:31   thresh       -> I am not sure, but I think this is 0x00 corresponding to the other topic

#So the final bitmask is 31: b0010_0000_1111_0000_0100_0010_0000_0000 or 0x00420F04

#Set 
#Channel0
MASK="0x00420F04"
for FUNCTION in  0 1
do
      setpci -s 7f:15.${FUNCTION} 0xD8.L=${MASK} 
done

RUN APPLICATION #Overflow of the counter is only a problem, if the application run more than 290 hours

#read the final values. Remember the reset of the counter!
for FUNCTION in 0 1
do
      #lower 32 bits
      setpci -s 7f:15.${FUNCTION} 0xA0.l
      #higher 32 bits
      setpci -s 7f:15.${FUNCTION} 0xA4.l
done

Please correct me, if there is any mistake. If I understood you right, the setpci is required since I have no interface available under /sys/devices that are valid for perf. But I have seen the following for the corresponding pci devices:

user$ ls /sys/devices/pci0000\:7f/0000\:7f\:15.0/

broken_parity_status      device         local_cpulist  numa_node  resource          uevent
class                     dma_mask_bits  local_cpus     power      subsystem         vendor
config                    enable         modalias       remove     subsystem_device
consistent_dma_mask_bits  irq            msi_bus        rescan     subsystem_vendor

And for /proc/iomen show me for the PCI config space the following:

80000000-8fffffff : PCI MMCONFIG 0000 [bus 00-ff]

This is not really like what I see, when perf is working, but perhaps we can use this to get perf working, at least with raw events? Or is the provided interface insufficient and I have to write my own through kernel module?

If I have to write a kernel module then, if I have understood you right, I can calculate from the /proc/iomem the address of the registers and update /sys/devices to get perf working. Right?

So what is not really clear to me is how I calculate the start address for each iMC from the PCI MMCONFIG address. I assume, that this is related to the bus:slot.func values, but I don't know the convention. So a hint would be great.

I was also now able to understand the Intel document with respect to the offsets. The important information was the 32 Bit PCI limitation. This really solved it. The offsets are in Table 2-111 on page 109 in the "PCIFG Address" column

Thank you very much for your help.

McCalpinJohn · ‎01-20-2017

There are many approaches to reading the uncore iMC counters:

Using "perf" with raw events
- This requires a kernel that understands the uncore devices (even if it does not understand the events in those devices).
- It requires that the /sys/bus/event_source/devices/ directory contain entries named "uncore_imc_" (the numbers should be 0,1,4,5 on the Xeon E5-2680 v3).
- I included the script that I use on Xeon E5 v1 systems in the forum discussion at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/517598#comment-1792608
  - For Xeon E5 v3, Event 0x01 (ACTivates) needs a different Umask on Xeon E5 v3 -- use 0x0B on the newer parts (as I showed in my other post referenced above)
  - If the OS is old (e.g., 2,6,32), you will need the "-C" option to "perf stat" so that it will only read the counters on the specific cores you requested (otherwise "perf stat" will read the uncore counters using all of the cores and add the results together, giving a number that is much too large). Cores 0 and 9 (used in my example) are in different sockets on my system, but a better approach is to use 0 and "max core number" -- these are on different sockets for every core numbering scheme that I have seen.
- You need to run as root unless the value of /proc/sys/kernel/perf_event_paranoid is less than or equal to zero.
Using "perf" with named events
- This requires a fairly new kernel (e.g., 3.10 or newer for Xeon E5 v3).
- Again, the /sys/bus/event_source/devices/ directory must contain entries named "uncore_imc_"
- In addition, the directories /sys/bus/event_source/devices/uncore_imc_*/events/ must contain files with the names of the events you are interested in.
- The format of the "events" directories has changed, and I have not tested this on the newer kernels.
- You need to run as root unless the value of /proc/sys/kernel/perf_event_paranoid is less than or equal to zero.
Using "setpci" to read/write PCI configuration space
- The example script I provided in the link referenced above uses "setpci" (as root) to program my favorite set of four events into the IMC counters of each channel on each controller on each socket.
- My bash script to read the counters is included below
Using the PCI device drivers to read/write PCI configuration space
- This approach is used by the tacc_stats project (https://github.com/TACC/tacc_stats)
  - The code to open and read the device driver is defined in the header file at https://github.com/TACC/tacc_stats/blob/master/monitor/src/intel_hsw_uncore.h
  - The functions defined in the header file are used in https://github.com/TACC/tacc_stats/blob/master/monitor/src/intel_hsw_imc.c to actually read the IMC counters.
  - The code is a bit hard to read due to the way the macro pre-processor is used, but there should be enough to figure out how to use this interface.
- An alternative is to look at the "pciutils" source distribution to see how "setpci.c" accesses these device drivers.
Using /dev/mem to read/write PCI configuration space
- I use this approach on Xeon Phi x200 processors (where there are lots and lots of counters and scripts are relatively slow).
- Obviously this requires root access.
  - Code should be checked very carefully before use -- it is quite easy to crash a system with incorrect accesses to /dev/mem
  - I usually program the counters externally using "setpci", and mmap() /dev/mem in a READONLY state for reading the counters.
- To make the logic easier, I mmap() the entire 256 MiB range of PCI configuration space, and associate it with a pointer to an array of unsigned 32-bit integers.
- I then use a simple piece of C code to translate from the bus:device:function:offset to an index into that mmap'd array.
  - Appended below (beneath the shell script).
- Reading the counters is then as simple as accessing array elements.
  - The accesses are 32-bits at a time, which then have to be concatenated.
  - This is similar to the core performance counters, which return their results in two 32-bit registers.
- Because the two 32-bit reads of each PCI configuration space counter cannot be made atomic, there is a small window in which the lower and upper reads will be inconsistent.
  - If the lower half is very close to rolling over (e.g., 2^32 - 100), then the counter may roll over and increment the upper half before you read it. Combining these two inconsistent parts would give a value that is approximately 2^32 larger than the correct answer. (Reading the counters in the opposite order does not help -- it just changes the sense of the error.)
  - So to be safe you can read each counter twice (low, high, low, high) and check to make sure that the concatenated results are consistent.
    - It may be safe to skip the re-read if the lower half is far away from its upper bound of 2^32-1, but it is difficult to be confident in any particular set of heuristics here.
  - I have not seen this error in practice, but it is bound to happen sooner or later....
Other tools may support the uncore counters even if the underlying OS does not.
- The "likwid" project has a number of useful tools such as likwid-perfctr (https://github.com/RRZE-HPC/likwid)
(Added in edit): Build your own loadable kernel module
- This has the advantage of allowing you to return many performance counter values in a single call -- dramatically reducing the overhead of transitioning in and out of the OS for every counter read (or half counter read).

Code bits appended.....

-------------------------

#!/bin/bash

echo "Reading IMC CycleCounter and Programmable counters "
echo "   Programmable counters are set to:"
echo -n "IMC_PerfEvtSel_0 0x"
setpci -s 7f:14.0 0xd8.l
echo -n "IMC_PerfEvtSel_1 0x"
setpci -s 7f:14.0 0xdc.l
echo -n "IMC_PerfEvtSel_2 0x"
setpci -s 7f:14.0 0xe0.l
echo -n "IMC_PerfEvtSel_3 0x"
setpci -s 7f:14.0 0xe4.l

BUS_by_Socket[0]=7f
BUS_by_Socket[1]=ff
DEVICE_by_IMC[0]=14
DEVICE_by_IMC[1]=17
FUNCTION_by_CHANNEL[0]=0
FUNCTION_by_CHANNEL[1]=1

for SOCKET in 0 1
do
	BUS=${BUS_by_Socket[$SOCKET]}
	for IMC in 0 1
	do
		DEVICE=${DEVICE_by_IMC[$IMC]}
		for CHANNEL in 0 1
		do
			FUNCTION=${FUNCTION_by_CHANNEL[$CHANNEL]}
			hi=`setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xD4.L`
			lo=`setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xD0.L`
			echo "Bus $BUS, Device $DEVICE, Function $FUNCTION, Cycle Counter, hi $hi, lo $lo, decimal 64-bit $((0x$hi$lo))"
			hi=`setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xA4.L`
			lo=`setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xA0.L`
			echo "Bus $BUS, Device $DEVICE, Function $FUNCTION, Counter 0, hi $hi, lo $lo, decimal 64-bit $((0x$hi$lo))"
			hi=`setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xAC.L`
			lo=`setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xA8.L`
			echo "Bus $BUS, Device $DEVICE, Function $FUNCTION, Counter 1, hi $hi, lo $lo, decimal 64-bit $((0x$hi$lo))"
			hi=`setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xB4.L`
			lo=`setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xB0.L`
			echo "Bus $BUS, Device $DEVICE, Function $FUNCTION, Counter 2, hi $hi, lo $lo, decimal 64-bit $((0x$hi$lo))"
			hi=`setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xBC.L`
			lo=`setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xB8.L`
			echo "Bus $BUS, Device $DEVICE, Function $FUNCTION, Counter 3, hi $hi, lo $lo, decimal 64-bit $((0x$hi$lo))"
		done
	done
done

--------------------

unsigned int PCI_cfg_index(unsigned int Bus, unsigned int Device, unsigned int Function, unsigned int Offset)
{
    unsigned int byteaddress;
    unsigned int index;
    assert (Bus == BUS);
    assert (Device >= 0);
    assert (Function >= 0);
    assert (Offset >= 0);
    assert (Device < (1<<5));
    assert (Function < (1<<3));
    assert (Offset < (1<<12));
#ifdef DEBUG
    printf("Bus,(Bus<<20)=%x\n",Bus,(Bus<<20));
    printf("Device,(Device<<15)=%x\n",Device,(Device<<15));
    printf("Function,(Function<<12)=%x\n",Function,(Function<<12));
    printf("Offset,(Offset)=%x\n",Offset,Offset);
#endif
    byteaddress = (Bus<<20) | (Device<<15) | (Function<<12) | Offset;
        index = byteaddress / 4;
    return ( index );
}

Stephan_W_ · ‎02-09-2017

Thx, for the really long answer! I have discussed the options, but my boss is not really happy with all of them, because he worries about security issues as well as increased noise on the system and therefore reduced performance for the users.

McCalpin, John wrote:

(Added in edit): Build your own loadable kernel module

This has the advantage of allowing you to return many performance counter values in a single call -- dramatically reducing the overhead if transitioning in and out of the OS for every counter read (or half counter read).

Is the reduced overhead compared to the usage of lspci or compared to tools like likwid and perf?

I thought also about the possibility to perform get performance counters by RDMA accesses, but I have seen, that at least the CPU performance counters are just accessible by the rdmsr instruction. But I was not able to determine, if this is also true for counters that are mapped through the PCI config space.

If it would great if it is possible to get the iMC counters through an RDMA access.

Since we have diskless nodes, I have thought also about a way how to get the data from the node with as few as possible impact on the node performance. So my idea is to write a kernel module that gather cycles, instructions, etc per CPU and store it within a small double buffer of 1MB or so each, that is accessible through RDMA. Do you think this is possible and has an overhead advantage compared to the usage of perf or likwid as a deamon running on each system that writes a file per node into a lustre filesystem?

McCalpinJohn · ‎02-09-2017

I don't have any systematic collections of performance counter overhead measurements, but they tend to fall into categories:

Very Fast (<0.1 microsecond)
- Execution of RDPMC instruction to read a core counter on the current core.
- Execution of a RDMSR instruction on the current core in kernel space.
- Execution of a 32-bit uncached load from memory-mapped PCI configuration space.
Fairly fast (1-3 microseconds)
- Reading a core performance counter on the local processor via PAPI or the "perf events" API.
- Reading an MSR on the local chip via the /dev/cpu/*/msr device driver
  - This uses "pread()" to read a single 8-Byte MSR.
  - The /dev/cpu/*/msr device drivers don't support multiple MSR reads in one call.
- Read 32 bits from memory-mapped PCI configuration space on the local chip using the /proc/bus/pci/* device drivers.
- Note that these last two assume that the device driver files are already open -- this is just the overhead per read.
Medium (10 microseconds)
- Reading an MSR on a remote socket via the /dev/cpu/*/msr device driver.
- Read 32 bits from memory-mapped PCI configuration space on a remote socket using the /proc/bus/pci/* device drivers.
- Read multiple (contiguous) performance counter values from PCI configuration space on the local socket using the /proc/bus/pci/* device drivers.
  - The PCI device driver interfaces allow larger block reads, so you can get all 5 IMC counters for a channel (4 programmable plus one fixed-function cycle counter) in a single 40-Byte "pread()" call.
- Note that these assume that the device driver files are already open -- this is just the overhead per read.
Slow (500-1000 microseconds)
- Reading an MSR using any core by launching the "rdmsr" command line tool: e.g., "rdmsr -p 2 -d 0x10"
- This is probably a measurement of the shell fork/exec overhead more than anything else.
Very Slow (>7500 microseconds)
- Reading a 32-bit PCI configuration space counter by launching the "setpci" command line tool: e.g. "setpci -s 7f:14.0 0xD0.L"
- Read the entire 256-Byte PCI Configuration space for a device with text output: e.g., "lscpi -xxxx -s 7f:14.0"
  - This gets all the performance counters from a device in one call, and costs less than twice as much as a single 32-bit read using "setpci".

These values are a combination of recollections and today's measurements on a Xeon E5-2690 v3 running Centos 7 (3.10.0-327) kernel.

These values are small enough that I don't worry about the overhead for monitoring whole-program execution using even the slowest of these approaches in a bash script, but this assumes that I am running before and after the workload and not running at the same time. My script to read 40 IMC counters (80 executions of "setpci" reading 32-bit values) takes slightly over 0.5 seconds to run. This is consistent with the 7.5 milliseconds mentioned above for the "setpci" command -- 7.5 milliseconds * 80 counters = 600 milliseconds -- quite close to the observed average of about 550 milliseconds.

On KNL, launching executables is a lot slower -- my script to read all the MCDRAM and DDR4 counters on the chip took 3 seconds of wall clock time. This was too high to tolerate, so I replaced it with a compiled C program that used the /dev/cpu/*/msr and /dev/mem interfaces to read all the counters in about 0.05 seconds.

I don't know if RDMA can be used for PCI Configuration Space accesses. If this works it would certainly reduce the local CPU overhead....

The "tacc stats" tool (https://github.com/TACC/tacc_stats) runs a daemon in the background that sleeps between counter reads. This avoids the overhead of process creation and opening the device drivers for each sample. It reads a lot of counters (generating over 10KiB of text output for each snapshot), so we only run it every 10 minutes, but it would not be difficult to disable many of these reads to make it a lighter-weight tool.

Configure and read iMC Performance Counters with Haswell E5 and missing /sys/devices/ export