UPI Performance Metrics meaning in pcm.x

Inspur · ‎04-27-2022

Dear all,

I have used pcm.x to analyze the performance of STREAM benchmark on a 2-S Intel IceLake 8358 server.

I found there are two performance metrics in the output, including:

data traffic coming to CPU/socket through UPI links
data and non-data traffic outgoing from CPU/socket through UPI links

and the output shows the values of each link as:

Intel(r) UPI data traffic estimation in bytes (data traffic coming to CPU/socket through UPI links):

UPI0 UPI1 UPI2 | UPI0 UPI1 UPI2
---------------------------------------------------------------------------------------------------------------
SKT 0 14 G 14 G 14 G | 58% 58% 58%
SKT 1 7398 M 7398 M 7398 M | 29% 29% 29%
---------------------------------------------------------------------------------------------------------------
Total UPI incoming data traffic: 66 G UPI data traffic/Memory controller traffic: 0.71

Intel(r) UPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through UPI links):

UPI0 UPI1 UPI2 | UPI0 UPI1 UPI2
---------------------------------------------------------------------------------------------------------------
SKT 0 16 G 16 G 16 G | 65% 65% 64%
SKT 1 20 G 20 G 20 G | 80% 80% 80%
---------------------------------------------------------------------------------------------------------------
Total UPI outgoing data and non-data traffic: 109 G

I want to know which metric is the true bandwidth of the data transferred, so I used the numactl to control STRAM running on Socket 0 (core 0-31) and utilizing the memory on Numa Node 1 (there are only 2 Numa Nodes).

numactl -C 0-31 -m 1 ./stream.icc

The perf result of memory bandwidth from pcm-memory.x is about 122 GB/s, which is all pass-through Socket 1.

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s):    40.87 --||-- Mem Ch  0: Reads (MB/s):  5163.79 --|
|--            Writes(MB/s):    26.92 --||--            Writes(MB/s): 10000.51 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  1: Reads (MB/s):    40.41 --||-- Mem Ch  1: Reads (MB/s):  5164.08 --|
|--            Writes(MB/s):    26.66 --||--            Writes(MB/s): 10000.85 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  2: Reads (MB/s):    41.59 --||-- Mem Ch  2: Reads (MB/s):  5164.55 --|
|--            Writes(MB/s):    27.75 --||--            Writes(MB/s): 10000.67 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  3: Reads (MB/s):    40.39 --||-- Mem Ch  3: Reads (MB/s):  5164.20 --|
|--            Writes(MB/s):    26.73 --||--            Writes(MB/s): 10000.50 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  4: Reads (MB/s):    41.33 --||-- Mem Ch  4: Reads (MB/s):  5207.03 --|
|--            Writes(MB/s):    27.24 --||--            Writes(MB/s): 10000.66 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  5: Reads (MB/s):    41.46 --||-- Mem Ch  5: Reads (MB/s):  5206.82 --|
|--            Writes(MB/s):    27.04 --||--            Writes(MB/s): 10000.82 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  6: Reads (MB/s):    41.39 --||-- Mem Ch  6: Reads (MB/s):  5205.36 --|
|--            Writes(MB/s):    27.10 --||--            Writes(MB/s): 10000.74 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  7: Reads (MB/s):    41.22 --||-- Mem Ch  7: Reads (MB/s):  5204.88 --|
|--            Writes(MB/s):    26.62 --||--            Writes(MB/s): 10000.56 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- NODE 0 Mem Read (MB/s) :   328.65 --||-- NODE 1 Mem Read (MB/s) : 41480.72 --|
|-- NODE 0 Mem Write(MB/s) :   216.07 --||-- NODE 1 Mem Write(MB/s) : 80005.32 --|
|-- NODE 0 PMM Read (MB/s):      0.00 --||-- NODE 1 PMM Read (MB/s):      0.00 --|
|-- NODE 0 PMM Write(MB/s):      0.00 --||-- NODE 1 PMM Write(MB/s):      0.00 --|
|-- NODE 0 Memory (MB/s):      544.72 --||-- NODE 1 Memory (MB/s):   121486.04 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--            System DRAM Read Throughput(MB/s):      41809.37                --|
|--           System DRAM Write Throughput(MB/s):      80221.39                --|
|--             System PMM Read Throughput(MB/s):          0.00                --|
|--            System PMM Write Throughput(MB/s):          0.00                --|
|--                 System Read Throughput(MB/s):      41809.37                --|
|--                System Write Throughput(MB/s):      80221.39                --|
|--               System Memory Throughput(MB/s):     122030.77                --|
|---------------------------------------||---------------------------------------|

The UPI perf result from pcm.x is about 77 GB and 124 GB as follows

Intel(r) UPI data traffic estimation in bytes (data traffic coming to CPU/socket through UPI links):

               UPI0     UPI1     UPI2    |  UPI0   UPI1   UPI2  
---------------------------------------------------------------------------------------------------------------
 SKT    0       16 G     16 G     16 G   |   63%    63%    63%   
 SKT    1     9669 M   9669 M   9669 M   |   38%    38%    31%   
---------------------------------------------------------------------------------------------------------------
Total UPI incoming data traffic:   77 G     UPI data traffic/Memory controller traffic: 0.71

Intel(r) UPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through UPI links):

               UPI0     UPI1     UPI2    |  UPI0   UPI1   UPI2  
---------------------------------------------------------------------------------------------------------------
 SKT    0       19 G     19 G     18 G   |   76%    76%    74%   
 SKT    1       22 G     22 G     22 G   |   87%    87%    72%   
---------------------------------------------------------------------------------------------------------------
Total UPI outgoing data and non-data traffic:  124 G

It seems the second metric ''

McCalpinJohn · ‎04-28-2022

You have picked an ugly and complicated test case....

Whenever doing experiments of this sort, I find it useful to look at the total amount of data moved, rather than the data rates. I typically run STREAM twice, once configured for 10 iterations (NTIMES=10) and once configured for 20 iterations (NTIMES=20), then take the difference between the counts as the approximate traffic associated with 10 iterations.

Whenever one is working with performance measurements related STREAM, it is critical to record & report whether the code was compiled to use streaming stores. This has gotten even more confusing with Ice Lake -- it will convert "regular" stores to "streaming" stores if the load on the memory subsystem is high enough, but "-qopt-streaming-stores always" gives streaming store behavior.

Starting with Skylake Xeon, Intel's 2-socket servers use "memory directory mode" for inter-socket cache coherence. This reduces the average latency of loads under conditions of low utilization and reduces coherence traffic on the UPI links under conditions of high utilization, but it has the side effect of requiring an extra DRAM write on the home node whenever a remote read obtains data in Exclusive or Modified state. This is discussed a little bit at https://community.intel.com/t5/Software-Tuning-Performance/SKL-strange-memory-behavior/td-p/1142144, and in other forum discussions that I can't find right now.

When configured with streaming stores, each iteration of the four STREAM kernels results in 6 arrays being read and 4 arrays being written. When the reads are all remote (as in your case), the memory controller will *write* the 6 arrays that are read in order to update the memory directory bits. This changes the DRAM read:write ratio from 6:4 (1.5R:1W) to 6:10 (0.6R:1W or 1.67W:1R). Your results show an even higher ratio of writes to reads (about 1.9W:1R), which will eventually need explaining.

When configured with "ordinary" stores, each iteration of the four STREAM kernels results in 6 source arrays being read, 4 target arrays being read, and 4 target arrays being written back. When data is all remote (as in your case), the "home" node memory controller will have to *write* the 6 arrays being read and the 4 arrays being written in order to update the memory directory bits. This is in addition to the write back of the 4 target arrays (which comes much later -- after the data has percolated through the caches and been victimized from the last-level cache). This gives a DRAM read:write ratio of (10R:14W), which is much lower than you are seeing.

By looking at the total amount of traffic driven by an execution of the code, rather than just the ratios of the bandwidths, it should be possible to compare your observations to the total traffic counts expected.

Note that the re-writing of the cache lines to update the memory directories is going to result in an inconsistency between write data volumes on the UPI links (which should always carry 4 arrays of writes from socket 0 to socket 1) and write data volumes at the memory controller (which will include the "real" data writes plus additional writes to update the memory directories).

I have not done much testing on the UPI traffic in two-socket processors with all three links enabled, but your results show a remarkably large inconsistency between UPI 0/1 (which are very similar) and UPI 2. If this is real, it will complicate the performance analysis, but should not impact the bulk data traffic accounting....