PCM for QPI link utilization undestanding

Piotr_T_ · ‎03-17-2016

Hi,
I'm looking for help with interpreting counter values for QPI link.
We are running VNF system on dual socket server, with Intel E5-2680v3 @ 2500 MHz.
Currently VNF is using PCI-passthrough to access 10G network interfaces, unfortunately 2 of them are attached to CPU0, other two are attached to CPU1, thus QPI utilization happens. With traffic increase we see QPI link utilization increase.
At some point we are hitting limit of this solution, but we are not sure whether it comes from QPI link utilization or other (can be also our load generator).
Here is output, when we see everything is still ok:

# ./pcm.x 5 -i=1 -nc 2>/dev/null

 EXEC  : instructions per nominal CPU cycle
 IPC   : instructions per CPU cycle
 FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
 AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state'  (includes Intel Turbo Boost)
 L3MISS: L3 cache misses
 L2MISS: L2 cache misses (including other core's L2 cache *hits*)
 L3HIT : L3 cache hit ratio (0.00-1.00)
 L2HIT : L2 cache hit ratio (0.00-1.00)
 L3MPI : number of L3 cache misses per instruction
 L2MPI : number of L2 cache misses per instruction
 READ  : bytes read from memory controller (in GBytes)
 WRITE : bytes written to memory controller (in GBytes)
 L3OCC : L3 occupancy (in KBytes)
 TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature
 energy: Energy in Joules


 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI |  L3OCC | TEMP

---------------------------------------------------------------------------------------------------------------
 SKT    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan    24576     14
 SKT    1     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     5280     42
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     N/A      N/A

 Instructions retired:    0   ; Active cycles:    0   ; Time (TSC):   12 Gticks ; C0 (active,non-halted) core residency: 0.00 %

 C1 core residency: 68.48 %; C3 core residency: 0.31 %; C6 core residency: 31.21 %; C7 core residency: 0.00 %;
 C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;

 PHYSICAL CORE IPC                 : -1.00 => corresponds to -25.00 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.00 => corresponds to 0.00 % core utilization over time interval
---------------------------------------------------------------------------------------------------------------

Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

               QPI0     QPI1    |  QPI0   QPI1
---------------------------------------------------------------------------------------------------------------
 SKT    0       24 G     24 G   |   25%    25%
 SKT    1       25 G     25 G   |   26%    26%
---------------------------------------------------------------------------------------------------------------
Total QPI outgoing data and non-data traffic:  100 G

          |  READ |  WRITE | CPU energy | DIMM energy
---------------------------------------------------------------------------------------------------------------
 SKT   0    54.96    29.68     610.01      68.19
 SKT   1     0.10     0.08     221.02      23.23
---------------------------------------------------------------------------------------------------------------
       *    55.05    29.76     831.04      91.42

Here we are starting to see performace degradation:

# ./pcm.x 5 -i=1 -nc 2>/dev/null

 EXEC  : instructions per nominal CPU cycle
 IPC   : instructions per CPU cycle
 FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
 AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state'  (includes Intel Turbo Boost)
 L3MISS: L3 cache misses
 L2MISS: L2 cache misses (including other core's L2 cache *hits*)
 L3HIT : L3 cache hit ratio (0.00-1.00)
 L2HIT : L2 cache hit ratio (0.00-1.00)
 L3MPI : number of L3 cache misses per instruction
 L2MPI : number of L2 cache misses per instruction
 READ  : bytes read from memory controller (in GBytes)
 WRITE : bytes written to memory controller (in GBytes)
 L3OCC : L3 occupancy (in KBytes)
 TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature
 energy: Energy in Joules


 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI |  L3OCC | TEMP

---------------------------------------------------------------------------------------------------------------
 SKT    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan    25488     15
 SKT    1     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     5376     42
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     N/A      N/A

 Instructions retired:    0   ; Active cycles:    0   ; Time (TSC):   12 Gticks ; C0 (active,non-halted) core residency: 0.00 %

 C1 core residency: 67.21 %; C3 core residency: 0.31 %; C6 core residency: 32.48 %; C7 core residency: 0.00 %;
 C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;

 PHYSICAL CORE IPC                 : -1.00 => corresponds to -25.00 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.00 => corresponds to 0.00 % core utilization over time interval
---------------------------------------------------------------------------------------------------------------

Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

               QPI0     QPI1    |  QPI0   QPI1
---------------------------------------------------------------------------------------------------------------
 SKT    0       25 G     26 G   |   26%    26%
 SKT    1       19 G     19 G   |   19%    19%
---------------------------------------------------------------------------------------------------------------
Total QPI outgoing data and non-data traffic:   91 G

          |  READ |  WRITE | CPU energy | DIMM energy
---------------------------------------------------------------------------------------------------------------
 SKT   0    61.28    36.31     609.89      72.08
 SKT   1     0.10     0.08     215.69      22.61
---------------------------------------------------------------------------------------------------------------
       *    61.38    36.38     825.57      94.69

McCalpinJohn · ‎03-17-2016

It is a little surprising to see no activity on the CPUs -- I thought VNF involved a software component?

How are you measuring the performance degradation? How much is the performance degrading? Is the relative performance degradation approximately the same fraction as the change in the QPI or DRAM traffic?

One thing that is clear from the results is that all of the memory accesses are to data on socket 0. Given that 2 of the network adapters are attached to socket 1, it may be worthwhile to look into spreading the data across the sockets.

The memory bandwidth required from socket 0 is not high -- in the first case you are moving about 85 GB in 5 seconds. The average of 17 GB/s is only about 25% of the 68.3 GB/s peak bandwidth of an optimally configured Xeon E5-2680 v3 (4 channels of DDR4/2133 -- with one or two single- or dual-rank DIMMs per channel). In the second case the traffic is about 15% higher (97 GB or just under 20 GB/s), but still comfortably below the sustainable DRAM bandwidth.

The maximum unidirectional QPI traffic is 50 GB in 5 seconds, or 10 GB/s (uniformly spread across the two links). The report annotates this with "25%", which appears to be based on a peak bandwidth of 9.6 GT/s*2channels*2Bytes/channel=38.4GB/s (peak). Unfortunately the maximum sustainable data bandwidth on the QPI depends on the "snoop mode" that the processor is configured with. In "early snoop" mode the Xeon E5 v3 has slightly lower local and remote memory latency, but much lower sustained QPI bandwidth (for CPU-initiated traffic). In "home snoop" mode, the Xeon E5 v3 has slightly higher local and remote memory latency, but much higher sustained QPI bandwidth (for CPU-initiated traffic). I don't have any results for IO-initiated traffic, so I don't know if the maximum sustainable QPI bandwidth for IO follows the same trends. For CPU-initiated traffic with a 1:1 Read:Write ratio, the Intel Memory Latency checker shows sustained cross-socket bandwidth of ~25 GB/s in "early snoop" mode and sustained cross-socket bandwidth of ~50 GB/s in "home snoop" mode -- so the difference is not small.

The change in QPI traffic does not match the change in DRAM traffic, suggesting that the load balance is different in the two cases. In the first case there is 98 GB of QPI traffic for 85 GB of DRAM traffic, while the second case has 89 GB of QPI traffic for 97 GB of DRAM traffic. Neither of these are nice numbers -- one would hope for some locality to make the QPI traffic lower than the DRAM traffic. More detailed analysis of the QPI traffic will depend on the "snoop mode", since the transaction types are different in "early snoop" and "home snoop" modes.

Piotr_T_ · ‎03-18-2016

No activity on CPU - this is might be due to the fact that we are isolating 11 cores from CPU0 on host level (from CPU1 too), and at KVM level CPU pinning is done for VNF (but just from CPU0).

Performance degradation is seen at our load generators, when sent traffic is increased, at some point received traffic falls down rapidly, but I wouldnt say its matches to QPI link utilization drop.

Spreading the data across the sockets - this is something we are trying to avoid due to the nature of our VNF - its multithreaded system where each thread has its own CPU core (HT), so to speed up all memory intensive operation we decided to do it on single CPU0, and leave QPI for IO, but we can look into that too.

Thanks for such detailed answer, can you suggest any reading about "snoop mode"?

Whole excercise is to check step by step where is bottleneck, or potential issue, it might be also our load generator too.

McCalpinJohn · ‎03-18-2016

The Xeon E5-2680 v3 supports 3 different "snoop modes" as boot-time (BIOS) options. As I mentioned above, "Early Snoop" mode has slightly lower memory latency in a 2 socket system, while "Home Snoop" has much higher sustained QPI bandwidth for CPU-initiated transactions. The latter difference is probably due to Intel's decisions about allocation of QPI buffers, rather than being intrinsic to the protocol, but there is not much documentation on these issues. The third mode (available on Xeon E5-26xx v3 processors with 10 or more cores) is called "Cluster On Die" mode, which reconfigures the memory and L3 mapping to split each chip into two "NUMA nodes", each with 1/2 the cores, 1/2 the L3, and 1/2 the DRAM channels. "Cluster on Die" mode provides the lowest latency and highest local memory bandwidth, for applications that have high memory affinity. I don't recall seeing any guidance from Intel on the impact of this mode on IO traffic, and I have not tested this mode on any of our Xeon E5-26xx v3 systems.

The most cost-effective approach to this issue is to simply boot your system in each of the three modes and test performance in each. This is not an unreasonable use of time -- your results suggest that the QPI link utilization is high enough that it could be a performance issue.

Getting a detailed understanding on the underlying cause of performance variations is often extremely difficult because it is so hard to distinguish between causes and effects. In your case the analysis may be especially difficult because of the potentially subtle interactions between CPU, IO, memory, and QPI throughput.

The Xeon E7 and E5 v3 Uncore Performance Monitoring Reference Manual provides information on performance monitors in the various components of the processor Uncore. The documentation of the events that can be monitored provides a wealth of indirect information about the protocols used. Controlled experiments using these performance counters can provide even more information, provided that you don't run away screaming because you can't tell which counters are broken and which ones are correctly reporting behavior that you don't understand.

Of course at a higher level the question is usually not "what is causing these performance issues", but rather "what can I do to improve performance"? For applications that are likely to be limited by data motion issues, it may be helpful to experiment with various BIOS options that influence snooping behavior (as mentioned above), uncore frequency, and CPU Turbo frequency. In some of my Xeon E5-26xx v3 systems there is a BIOS option to set the uncore frequency to "maximum" (as opposed to "dynamic"), which definitely helps with uncore throughput at the cost of slightly higher idle power consumption. I also typically disable BIOS control of CPU frequencies and use the OS to set the CPU frequency to the desired value. This is often enough to eliminate confusing and unpredictable performance variability.

brown__steve · ‎07-25-2018

Is there any documentation on how to interpret UPI stats?

For instance if I have 10G of UPI incoming data traffic and all 3 lanes are 7% utilized how might I interpret that?

And as for outgoing traffic, if I have 23G outgoing and all 3 lanes are 16% utilized.

Documentation or a guide would be helpful in understanding what all of this means.

Thanks!

McCalpinJohn · ‎07-27-2018

There is a lot of discussion in sections 2.6 and 2.8 of the Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual (document 336274-001, July 2017), and there is a lot of information implied in the tables in section 3.1.2 ("Reference for Intel UPI LL Packet Matching").

I have not been able to make sense of the discussion in Section 2.6.3, Figure 2-11, but I have not spent a lot of time on the UPI counters yet....