Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Get RING_THRU_DN_BYTES and RING_THRU_UP_BYTES

GHui
Novice
2,430 Views

I'm reading xeon-e5-v3-uncore-performance-monitoring.pdf. And I want to get RING_THRU_DN_BYTES and RING_THRU_UP_BYTES.

I have no idea about get bus, dev, fucn and its event code. And which device I should open.

 

 

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
2,430 Views

When using the IMC counters, I can't get the "freeze" function to work via MSR 0x700.

I can get the counters to "freeze" by writing a "1" to bit 8 of MC_CHy_PCI_PMON_BOX_CTL, but I cannot get the counters to "unfreeze" by writing a "0" to that spot.  I can only get them to "unfreeze" by rewriting the corresponding MC_CHy_PMON_CTL location.

My script to set up the IMC counters on Xeon E5 v3 works fine -- the IMC counts are all reasonable:

#!/bin/bash

# IMC Performance Events
# Most of our nodes have 2 channels on each of 2 IMCs
#     Buses [7f,ff], Devices [0x14,0x17], Functions [0,1]
# Each of these has four programmable counters
# Counter   Offset    Value       Description
#    0      0xD8   0x00400B01     ACT.(READ+WRITE+BYPASS) -- Umasks are new with Haswell
#    1      0xDC   0x00400304     CAS_COUNT.READS
#    2      0xE0   0x00400C04	  CAS_COUNT.WRITES
#    3      0xE4   0x00400102	  PRE_COUNT.MISS -- page closes due to page conflicts

echo "Setting up IMC Performance Counters"
for BUS in 7f ff
do
	for DEVICE in 14 17
	do
		for FUNCTION in 0 1
		do
			lspci -s ${BUS}:${DEVICE}.$FUNCTION
			setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xD8.L=0x00400B01
			setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xDC.L=0x00400304
			setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xE0.L=0x00400C04
			setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xE4.L=0x00400102
		done
	done
done

 

View solution in original post

0 Kudos
19 Replies
McCalpinJohn
Honored Contributor III
2,430 Views

These are listed as derived metrics for three different units in the processor uncore.

For the CBo units and the SBo units, the performance counters are accessed by MSRs, which you can read and write through the /dev/cpu/*/msr interface.

For the R2PCIe units, the performance counter interface is in Device 16 (decimal), Function 1.   On a properly configured Linux system with a properly configured BIOS, the output of "lspci" will include lines like:

7f:10.1 Performance counters: Intel Corporation Xeon E5 v3/Core i7 PCIe Ring Interface (rev 02)

  • The first two characters (before the ":") are the PCI bus number.  The uncore devices on each socket are mapped to different PCI buses.  My two-socket systems typically use "7f" and "ff", though some use "1f" and "7f".  If the devices are there, they will be easy to find.
  • The one or two characters between the ":" and the "." are the hexadecimal device number, so the "10" above is Device 16 (decimal).
  • The character after the "." is the Function, so this is Function 1.

The Linux OS will create a device driver for this Bus:Device:Function at

/proc/bus/pci/7f/10.1

This device is used in the same was as the /dev/cpu/*/msr device drivers -- typically using pread() and pwrite() calls.

Because these interfaces are in PCI configuration space, they can also be accessed from the command line, using the "setpci" command to read or write byte/word/doubleword quantities, or using the "lspci" command with the "-xxx" option to dump the entire PCI configuration space for a particular device.

Everything here requires root access.  

Writing the wrong things to the wrong places in PCI configuration space or in MSR space could hose your system.

0 Kudos
GHui
Novice
2,430 Views

Thanks for your help.

In my system it is

7f:10.1 Performance counters: Intel Corporation Xeon E5 v3/Core i7 PCIe Ring Interface (rev 02)
ff:12.1 Performance counters: Intel Corporation Xeon E5 v3/Core i7 Home Agent 0 (rev 02)

So I open it by the following code.

  int bus[2]={0x7f,0xff};
  for(i=0;i<2;i++)
  {
    sprintf(filename,"/proc/bus/pci/%x/10.1",bus);
    fd=open(filename,O_RDONLY);
  }

And I set the R2_PCI_PMON_BOX_CTL, R2_PCI_PMON_CTL2, and R2_PCI_PMON_CTL3.

  for(i=0;i<2;i++)
  {
    value=1|1<<1ULL;
    pwrite(fd,(void*)&value,sizeof(uint32),0xF4);
    value=0x09|0xC<<8ULL|0x1<<22ULL; // RING_THRU_DN_BYTES CCW
    pwrite(fd,(void*)&value,sizeof(uint32),0xE0);
    value=0x09|0x3<<8ULL|0x1<<22ULL; // RING_THRU_UP_BYTES CW
    pwrite(fd,(void*)&value,sizeof(uint32),0xE4);
  }

Then I read R2_PCI_PMON_CTR2 and R2_PCI_PMON_CTR3.

  for(i=0;i<2;i++)
  {
    pread(fd,(void*)&val1,sizeof(uint32),0xB4);
    pread(fd,(void*)&val2,sizeof(uint32),0xB0);
    pread(fd,(void*)&val3,sizeof(uint32),0xBC);
    pread(fd,(void*)&val4,sizeof(uint32),0xB8);
    printf("[%d] B4=%d B0=%d BC=%d B8=%d\n",i,val1,val2,val3,val4);
  }

I get the log as following.

[0] B4=0 B0=0 BC=0 B8=0
[1] B4=0 B0=0 BC=0 B8=0
[0] B4=0 B0=0 BC=0 B8=0
[1] B4=0 B0=0 BC=0 B8=0

I don't known how to test it, but I think my result is wrong. Maybe I set the wrong *CTL registers.

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,430 Views

This looks OK, but there are a couple of other items that need to be checked...

  1. Make sure that the U_MSR_PMON_GLOBAL_CTL register (MSR 0x0700) is not set to "freeze" all of the uncore counters (as described in section 2.1 of the Xeon E5 v3 uncore performance monitoring guide).
  2. It is always a good idea to check the PCI device ID in the code to make sure you are in the right place.   The 16 bit field at offset 0 should be 0x8086 (indicating an Intel device), and the next 16-bit field should be 0x2f34 (the "DID" for this uncore device).

You can also use "lspci" after running your program as in independent check that you have written to the desired addresses.

To verify the company and device you can run this as a normal user, but to get beyond 64 Bytes you need to run as root.   On my Xeon E5-2667 v3 I see:

$ /sbin/lspci -xxx -s ff:10.1
ff:10.1 Performance counters: Intel Corporation Xeon E5 v3/Core i7 PCIe Ring Interface (rev 02)
00: 86 80 34 2f 00 00 00 00 02 00 01 11 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 34 2f
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Note that "lspci" prints by byte with the lowest address to the left, so the first 16 bits is 0x8086 and the next 16 bits is 0x2f34, as desired.  When I run the same command as root I get the rest of the first 256 Bytes of PCI configuration space, but nothing is in there since I have not attempted to program those counters.

0 Kudos
GHui
Novice
2,430 Views

I read R2PCIe event, and open /proc/bus/pci/7f/10.1 and /proc/bus/pci/ff/10.1. But U_MSR_PMON_GLOBAL_CTL is in /dev/cpu/*/msr, it is more than 2 file on my machine. I don't undestand their relation. If I do, which msr can control the pci?

In the document

OR (if box level freeze control preferred)
a) Freeze the box’s counters while setting up the monitoring session.
e.g., set Cn_MSR_PMON_BOX_CTL.frz to 1

I set R2_PCI_PMON_BOX_CTL. I change the set.

  for(i=0;i<2;i++)
  {
    value=0x1<<8ULL;
    pwrite(fd,(void*)&value,sizeof(uint32),0xF4); // R2_PCI_PMON_BOX_CTL
    value=0x1<<22ULL;
    pwrite(fd,(void*)&value,sizeof(uint32),0xE0);
    value=0x09|0xC|0x1<<22ULL;
    pwrite(fd,(void*)&value,sizeof(uint32),0xE0); // RING_THRU_DN_BYTES CCW
    value=0x1<<22ULL;
    pwrite(fd,(void*)&value,sizeof(uint32),0xE4);
    value=0x09|0x3|0x1<<22ULL;
    pwrite(fd,(void*)&value,sizeof(uint32),0xE4); // RING_THRU_UP_BYTES CW
    value=0x1|0x1<<1ULL|0x1<<8ULL;
    pwrite(fd,(void*)&value,sizeof(uint32),0xF4);
    value=0x1|0x1<<1ULL;
    pwrite(fd,(void*)&value,sizeof(uint32),0xF4);
  }

And also there no use for this. I also get the resualt.

[0] B4=0 B0=0 BC=0 B8=0
[1] B4=0 B0=0 BC=0 B8=0
[0] B4=0 B0=0 BC=0 B8=0
[1] B4=0 B0=0 BC=0 B8=0
[0] B4=0 B0=0 BC=0 B8=0
[1] B4=0 B0=0 BC=0 B8=0

 

 

0 Kudos
GHui
Novice
2,430 Views

I also confused that how could I known the CPU bus is 3f, 7f, bf or ff. There is no 3f, 7f, bf and ff on E5-4620.

0 Kudos
McCalpinJohn
Honored Contributor III
2,430 Views

(1) The U_PMON_MSR_GLOBAL_CTL MSR is accessible from every core, but it has "package scope", so all cores in a package are accessing the same single register in the UBox of the uncore.   As described in Section 2.2 of the Xeon E5 v3 Uncore Performance Monitoring Guide, the UBox serves as the "system configuration controller" for the processor and is the "master for reading and writing the physically distributed registers across [the processor]".   This means that the Ubox manages access to both the MSRs (at least the ones outside the local core) and PCI configuration space, so it is no surprise that the global control register for the Uncore performance counters is located here.

(2) As described in Chapter 1 of Volume 2 of the processor datasheet, the bus number used for the processor uncore device configuration space can be located using the CPUBUSNO register.  The CPUBUSNO register (described in section 6.6.33) is located on bus 0, device 5, function 0, offset 0x108, bits 15:8.  On one of my Xeon E5 v3 boxes I see the value of 0x7f in this bit field:

$ setpci -s 0:5.0 0x108.l
00017f00

All of my 2-socket systems use either [3f,7f] or [7f,ff], while the 4-socket boxes all use [3f,7f,bf,ff].   Although the procedure it seems overly complex, Section 1.6.1 of the Xeon E5 v3 Uncore Performance Monitoring Guide provides explicit documentation for finding the bus numbers. 

It is important to realize that the BIOS must understand and properly configure these buses or the system would not work.   If the OS cannot see them, this appears to be because the BIOS refuses to grant the OS permission to control the buses.   They are still present and functional, but if the BIOS refuses to allow control during the PCI discovery process, the OS will not enumerate the buses and build the internal databases used by lspci, and the OS will not build the corresponding pseudo-files under the /proc, /sys, and /dev directories.

You can still access these functions, but you have to do it using the memory-mapped interface.  PCI configuration space is accessed as a contiguous 256 MiB region of memory-mapped IO space.  You can usually find it immediately by looking for "PCI MMCONFIG" in the output of "cat /proc/iomem".    Each of the possible buses, devices, and functions maps to a 4KiB block in this range.   If I recall correctly, the offset into this range is computed by concatenating the bits:

Bus number: bits 27:0

Device number: bits 19:15

Function: bits 14:12

Offset: bits 11:0

Then this is added to the PCI MMCONFIG base address and used as the physical address using the /dev/mem interface.  (Actually it makes more sense to "mmap()" a 256 MiB range starting at the PCI MMCONFIG base address and then using the computed address directly.)

Obviously you can get in a lot of trouble if you write the wrong things to the wrong addresses through the /dev/mem interface, so it definitely pays to do a lot of read-only testing first.   All of our codes that use this interface check the first two 16-bit fields of each 4KiB region -- the first 16-bit field should be 0x8086 (indicating an Intel device), while the second 16-bit field should be the Device ID listed in Volume 2 of the processor datasheet or in the corresponding section of the Uncore Performance Monitoring Guide.  If these don't match, you are either running on the wrong system or you have an addressing error.

0 Kudos
GHui
Novice
2,430 Views

I write U_MSR_PMON_GLOBAL_CTL failed.

Here is my code

  uint32 value;
  sprintf(filename,"/dev/cpu/0/msr");
  msrfds=open(filename,O_RDWR);

  value=0x1<<31ULL;
  printf("%x\n",value);
  rs=pwrite(msrfds,(void*)&value,sizeof(uint32),0x0700); // U_MSR_PMON_GLOBAL_CTL
  perror("msr pwrite");
  printf("RETURN 0x0700 write: %ld %d\n",rs,msrfds);

Here is the output

80000000
msr pwrite: Invalid argument
RETURN 0x0700 write: -1 5

 

 

0 Kudos
GHui
Novice
2,430 Views

I use the following code to detect the CPU UNCORE bus.

  for(bus_no=0;bus_no<256;bus_no++)
  {
    device_no=5;
    function_no=108;
    sprintf(filename,"/proc/bus/pci/%02x/05.0",bus_no);
    fd=open(filename,O_RDWR);
    if(fd>0)
    {
      pread(fd,(void*)&value,sizeof(value),0x0);
      printf("BUS: %x DID: %lx\n",bus_no,value);
      if(0x2F288086==value)
      {
        pread(fd,(void*)&value,sizeof(value),0x108);
        printf("CPU UNCORE BUS: %x\n",(value&0x0FF00)>>8);
      }
      close(fd);
    }
  }

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,430 Views

The last paragraph of section 2.10.2.2 of the Xeon E5 v3 Uncore Performance Monitoring Guide mentions that these registers must be written twice in a row in order to work.   I don't what "twice in a row" means in this context -- when I use the "setpci" command to write these locations the values are updated the first time I write the data, but the performance counters don't start counting.   Re-writing the counters from the shell using setpci does not change the behavior.

I don't have any trouble writing to MSR 0x700 using the "wrmsr" command-line tool, but I have not tested to see whether this actually does anything.

0 Kudos
GHui
Novice
2,430 Views

I have try. It must write 64bits to 0x700. But document show that it is 32bits.

0 Kudos
GHui
Novice
2,430 Views

I have write to 0x700, but it seems does not take effect.

  value64=0x1<<31ULL&0x0FFFFFFFF;
  rs=pwrite(msrfd[0],(void*)&value64,sizeof(value64),0x0700); // U_MSR_PMON_GLOBAL_CTL
  rs=pwrite(msrfd[0],(void*)&value64,sizeof(value64),0x0700); // U_MSR_PMON_GLOBAL_CTL
  pread(msrfd[0],(void*)&value,sizeof(uint32),0x700);
  printf("0x700 frz_all: %lx\n",value);

 

And I also use wrmsr

[root@hsw-01 msr-tools-1.1.2]# ./wrmsr 0x700 0x80000000
[root@hsw-01 msr-tools-1.1.2]# ./rdmsr 0x700
0

 

0 Kudos
GHui
Novice
2,430 Views

Here's my code to collect RING*.

gcc pci.c -o pci

0 Kudos
McCalpinJohn
Honored Contributor III
2,430 Views

Bit 31 of MSR 0x700 is clearly documented as a "Write Only" field (Table 2-2), so you should not expect to see it change.

The right way to test it is to see if it actually freezes counting of other uncore performance counters.  You mentioned that you were able to program the IMC counters -- start them up and see if writing to bit 31 of MSR 0x700 freezes the counts, and if writing to bit 29 unfreezes the counters.

0 Kudos
GHui
Novice
2,430 Views

I have code for IMC, it didn't take effect also. The bits setting is so complex. I have upload the code. It can compile by "gcc pcimem.c". Whether if the code runs right on your platform.

0 Kudos
McCalpinJohn
Honored Contributor III
2,431 Views

When using the IMC counters, I can't get the "freeze" function to work via MSR 0x700.

I can get the counters to "freeze" by writing a "1" to bit 8 of MC_CHy_PCI_PMON_BOX_CTL, but I cannot get the counters to "unfreeze" by writing a "0" to that spot.  I can only get them to "unfreeze" by rewriting the corresponding MC_CHy_PMON_CTL location.

My script to set up the IMC counters on Xeon E5 v3 works fine -- the IMC counts are all reasonable:

#!/bin/bash

# IMC Performance Events
# Most of our nodes have 2 channels on each of 2 IMCs
#     Buses [7f,ff], Devices [0x14,0x17], Functions [0,1]
# Each of these has four programmable counters
# Counter   Offset    Value       Description
#    0      0xD8   0x00400B01     ACT.(READ+WRITE+BYPASS) -- Umasks are new with Haswell
#    1      0xDC   0x00400304     CAS_COUNT.READS
#    2      0xE0   0x00400C04	  CAS_COUNT.WRITES
#    3      0xE4   0x00400102	  PRE_COUNT.MISS -- page closes due to page conflicts

echo "Setting up IMC Performance Counters"
for BUS in 7f ff
do
	for DEVICE in 14 17
	do
		for FUNCTION in 0 1
		do
			lspci -s ${BUS}:${DEVICE}.$FUNCTION
			setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xD8.L=0x00400B01
			setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xDC.L=0x00400304
			setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xE0.L=0x00400C04
			setpci -s ${BUS}:${DEVICE}.${FUNCTION} 0xE4.L=0x00400102
		done
	done
done

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,430 Views

I guess I should note that this script only works on Xeon E5 v3 processors that have 2 "Home Agents", with one IMC per Home Agent and 2 DRAM channels per IMC.   Some of the processors have only one Home Agent, with only one IMC and 3 or 4 DRAM channels attached to that IMC.

The number of Home Agents in each Xeon E5 v3 is listed in Table 1 of the Xeon E5 v3 Specification Update (document 330785).

0 Kudos
GHui
Novice
2,430 Views

The document shows that I must set a, b, c, d, e and f step to collect the event, here only set *CTL register. Is the document wrong?

0 Kudos
GHui
Novice
2,430 Views

It seems that "Address Map" is 2F28h on xeon-e5-v3. Others is not, or there no such "Address Map".I can't get e5(SNB)'s "Address Map". Are they different on this?

I can find xeon-e5-v3 and xeon-e5-v2 datasheet-vol2, but I can't get xeon-e5 datasheet-vol2.

0 Kudos
McCalpinJohn
Honored Contributor III
2,430 Views

Section 2.1.2 of the Xeon E5 v3 Uncore Performance Monitoring Guide (document 331051) does list steps a,b,c,d,e,f, but they are definitely not all needed.  

  • I never use the "freeze" function, so steps a & f are not needed.
  • Steps b & c can be combined into a single write of the control register (as I do in my script).
  • Step d is only necessary if you want the counters to start at zero.   All I need to know is that the counter cannot wrap around more than once during the measurement interval.  If this is true, then it is easy to correct for the case where the counter overflows exactly once.
  • Step e is only necessary if you want to use the "interrupt on overflow" feature, which I do not use.

The "Address Map" PCI Configuration Space function has a different device ID for each processor generation, but it is Bus 0, Device 5, Function 0 in all three Xeon E5 generations.   The Device IDs are:

  • Xeon E5 2600 (gen1, Sandy Bridge): DID 0x3C28, datasheet volume 2 is document 326509
  • Xeon E5 2600 (gen2, Ivy Bridge):       DID 0x0E28, datasheet volume 2 is document 329188
  • Xeon E5 2600 (gen 3, Haswell):         DID 0x2F28, datasheet volume 2 is document 330784
0 Kudos
Reply