Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

[PCM] ERROR: QPI LL monitoring device (0:127:9:2) is missing

Jeongseob_A_
Beginner
1,523 Views

Hi all,

I am currently using the Intel PCM tool v2.8 on my machine(Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz). It has two sockets. Basically, I would like to measure QPI traffic. Specifically, I want to get traffics due to L3 cache miss traffics. As I know, if L3 miss occurs, then the request will be sent to the other socket because it is faster than memory. When an application which generates a number of cache misses is running on a NUMA node, then I expect the other NUMA node will receive many QPI traffics. I am wondering how many cache traffics occur. However, unfortunately, there is an error for QPI like below when I run pcm.x. 

------------------------------------------------------------------------------------------------------------------------------------------------

ERROR: QPI LL monitoring device (0:127:9:2) is missing. The QPI statistics will be incomplete or missing.
Socket 0: 2 memory controllers detected with total number of 5 channels. 1 QPI ports detected.
ERROR: QPI LL monitoring device (0:255:9:2) is missing. The QPI statistics will be incomplete or missing.
Socket 1: 2 memory controllers detected with total number of 5 channels. 1 QPI ports detected.

------------------------------------------------------------------------------------------------------------------------------------------------

Even if there is an error, I could get statistics like below, but it seems like there is correct because the percentage of QPI traffics and 'QPI data traffic/Memory controller traffic' field are not updated. In this experiments, I pinned the application running on SKT1. Could you give me some advice for the error?

------------------------------------------------------------------------------------------------------------------------------------------------

Intel(r) QPI data traffic estimation in bytes (data traffic coming to CPU/socket through QPI links):

               QPI0    |  QPI0  
----------------------------------------------------------------------------------------------
 SKT    0       92 K   |    0%   
 SKT    1     5081 K   |    0%   
----------------------------------------------------------------------------------------------
Total QPI incoming data traffic: 5173 K     QPI data traffic/Memory controller traffic: 0.00

Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

               QPI0    |  QPI0  
----------------------------------------------------------------------------------------------
 SKT    0      632 M   |    3%   
 SKT    1      628 M   |    3%   
----------------------------------------------------------------------------------------------
Total QPI outgoing data and non-data traffic: 1261 M

------------------------------------------------------------------------------------------------------------------------------------------------

 

0 Kudos
14 Replies
Thomas_W_Intel
Employee
1,523 Views

As I know, if L3 miss occurs, then the request will be sent to the other socket because it is faster than memory.

Please excuse if I'm stating the obvious but this sentence indicates that there is a misunderstanding.

Unless you populated your system unevenly, there is memory on both sockets. If a program running on socket 1 is accessing memory on socket 0, the request needs to go over the QPI link from socket 1 to socket 0, then through the memory controller on socket 0 to the DIMMs that are attached to socket 0. The data is then sent back from the DIMMs, through the memory controller in socket 0, over the QPI link connecting the two sockets, to socket 1.
In case the program running on socket 1 is accessing memory on socket 1, going over the QPI link is not necessary and the access is faster.

Intel Memory Latency Checker lets you measure the time it takes to access local and remote memory on your system.

Since local memory access is faster, the OS normally tries to provide local memory to a program. Ideally, you would therefore not see QPI traffic but mostly local memory access. Please have a look at the memory traffic that PCM is reporting. If you pinned the program to socket 1, you should see the memory traffic on socket 1.

0 Kudos
Jeongseob_A_
Beginner
1,523 Views

Hi Thomas,

Thanks for your comment. I have an additional question. Even if your application is running on the socket 0 and its memory is allocated into the socket 0, I think the requests of L3 cache misses should be sent to socket 1 because of cache coherence. Am I wrong?

 

Jeongseob

 

 

0 Kudos
Thomas_W_Intel
Employee
1,523 Views

You are right. These are the so-called "snoops" that are sent to other sockets to verify if a cache-line resides in a cache on one of the other sockets. However, the Intel Xeon processor E5-2630 v3 has a directory, which is used to reduce the snoop traffic. As you might guess, the effectiveness of the directory depends on the workload.

0 Kudos
Jeongseob_A_
Beginner
1,523 Views

I appreciate your comments. I did not know the processor implements the directory to meet the coherent cache. I am just wondering who decide the home directory in those systems. If the home directory is remote node, all traffics for the L3 cache misses will be sent to the home node. 

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,523 Views

For coherence purposes, the "home" of any memory address is the chip that is physically connected to the DRAM containing that memory address.

For *most* systems, a local memory access is faster than the probe of the remote socket(s), so for clean data all you need from the remote socket is the "clean" response -- then you can use the data that you have already received from the local memory controller.

If the access is to *remote* data and the data is available in the remote L3 cache, then it can be faster to provide that data from the remote L3 cache (a "clean intervention") than from the remote DRAM.  Whether this is supported or not depends on lots of complicated details of the coherence protocol that are not typically obvious (or publicly documented).  I think that Intel's "F" state supports this type of "Forwarding"?

Historically, some systems have had faster cache-to-cache interventions than local (or remote) memory accesses, so in those cases the clean intervention from the remote cache is preferred.  The IBM POWER4/5 systems are in this category.  (I don't know about the relative latencies of more recent POWER processors.)

0 Kudos
Ilya_M_1
Beginner
1,523 Views

Hi John, Thomas,

I'd like to revisit the PCM error that the OP described. I have similar architecture (Xeon E5-2620) on a dual socket system and I'm also getting the following error:

ERROR: QPI LL monitoring device (0:127:9:2) is missing. The QPI statistics will be incomplete or missing.
Socket 0: 2 memory controllers detected with total number of 5 channels. 1 QPI ports detected.
ERROR: QPI LL monitoring device (0:255:9:2) is missing. The QPI statistics will be incomplete or missing.
Socket 1: 2 memory controllers detected with total number of 5 channels. 1 QPI ports detected.

Using the memoptet and looking at the output of PCM, the QPI traffic values look right. But can these reading be trusted? How is PCM able to read the traffic without the QPI LL monitoring devices? Posting below the output of lspci.

# lspci | grep Perf
7f:08.2 Performance counters: Intel Corporation Device 2f32 (rev 02)
7f:0b.1 Performance counters: Intel Corporation Device 2f36 (rev 02)
7f:0b.2 Performance counters: Intel Corporation Device 2f37 (rev 02)
7f:10.1 Performance counters: Intel Corporation Device 2f34 (rev 02)
7f:10.6 Performance counters: Intel Corporation Device 2f7d (rev 02)
7f:12.1 Performance counters: Intel Corporation Device 2f30 (rev 02)
ff:08.2 Performance counters: Intel Corporation Device 2f32 (rev 02)
ff:0b.1 Performance counters: Intel Corporation Device 2f36 (rev 02)
ff:0b.2 Performance counters: Intel Corporation Device 2f37 (rev 02)
ff:10.1 Performance counters: Intel Corporation Device 2f34 (rev 02)
ff:10.6 Performance counters: Intel Corporation Device 2f7d (rev 02)
ff:12.1 Performance counters: Intel Corporation Device 2f30 (rev 02)

Thanks in advance,
Ilya 

0 Kudos
Roman_D_Intel
Employee
1,523 Views

Hi Ilya,

PCM expects to find performance monitoring devices (PMUs) for both QPI links that should be available on Intel Xeon E5-2620. It finds one at 7f:127:8:2 but none at 7f:127:9:2 (for the first processor), same for the second processor at bus 0xff. Do you see all the expected traffic over the single QPI link PMU that is exposed?

Thanks,

Roman

0 Kudos
Ilya_M_1
Beginner
1,523 Views

Hi Roman,

Now that you mention it, pcm indeed reports only about half of the traffic (432M) that memoptest reports (727M). The BIOS on the system is AMI 5.009, I guess it's a BIOS issue.

I'm probably misunderstanding something, but shouldn't the corresponding outgoing and incoming counters be equal? AFAIK, QPI is a point-to-point interconnect, so this should also be true to systems with more than 2 sockets. But in this case, having separate counters for outgoing and incoming traffic will be redundant, so I guess I'm missing something here.

# numactl --cpunodebind=1 --membind=0 ./memoptest.x 0
Elements data size: 203125 KB
Reading memory
Bandwidth: 727.59 MByte/sec

# pcm.x

...

 

Intel(r) QPI data traffic estimation in bytes (data traffic coming to CPU/socket through QPI links):

               QPI0    |  QPI0
----------------------------------------------------------------------------------------------
 SKT    0      905 K   |    0%
 SKT    1      383 M   |    2%
----------------------------------------------------------------------------------------------
Total QPI incoming data traffic:  384 M     QPI data traffic/Memory controller traffic: 0.48

Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

               QPI0    |  QPI0
----------------------------------------------------------------------------------------------
 SKT    0      432 M   |    2%
 SKT    1       49 M   |    0%
----------------------------------------------------------------------------------------------
Total QPI outgoing data and non-data traffic:  482 M

Thanks,
Ilya

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,523 Views

I have seen a number of bugs in the QPI link-layer counters that impact Transmitted and Received packet counts differently, so that they don't match.   In some cases one of the values is correct and in some cases neither is correct.   The bugs vary by processor family, and unfortunately some of the events that appeared to be correct in Sandy Bridge don't appear to be correct in Haswell.   I have not worked on this in a while, so I can't find a set of coherent notes right now...
 

0 Kudos
Ilya_M_1
Beginner
1,523 Views

John,

But why both direction counters were implemented in the first place? For packets lost on the way? :)

Thanks,
Ilya

0 Kudos
Thomas_W_Intel
Employee
1,523 Views

Ilya,

Measuring incoming and outgoing traffic separately is important on systems where the QPI link is connected to something different than another processor. For example, large servers with more than 8 sockets contain third-party node controllers that are "gluing" together components of 2- or 4-socket parts.

Please note that in PCM, there is a difference between the the reported incoming and outgoing traffic: The incoming traffic reports only data, the outcoming traffic reports data+snoops. Nevertheless, the numbers are obviously bogus if the counter is broken. We need to check this.

Kind regards

Thomas

0 Kudos
Ilya_M_1
Beginner
1,523 Views

Thomas,

Thanks for the clarification!
My understanding was that with multiple nodes a mesh topology is used.

Cheers,
Ilya

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,523 Views

The most common 4-socket systems (using Xeon E5-46xx v1/v2/v3) parts do use a directly connected mesh (which is also a ring for this simple case).  In these cases the transmit traffic from one chip *should* match the received traffic on the "downstream" chip.    There are a number of counter groups available, with several sub-masks each, so it is possible to set up the counters so that they should be counting the same traffic in each direction.  Since the traffic is supposed to be the same, this configuration would waste 1/2 of the available counters, so it makes sense to define different events for the transmit and receive directions (as Thomas Willhalm mentioned).

There are some less common systems that use "bridge" chips (for example the SGI UV systems), and there are some systems that have a non-processor chip (such as an FPGA) in one of the sockets.  In either of these cases you definitely want the flexibility of being able to count any of the events in either the inbound or outbound directions.

0 Kudos
Ilya_M_1
Beginner
1,523 Views

I see, thanks John!

0 Kudos
Reply