Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

[PCM] QPI traffic reported all zeros

Thomas_W_Intel
Employee
5,408 Views

Zheng L. posted:

Hello everyone, I try to get some data from the Intel Xeon E5-2687W by using PCU. Beause of the project reason, we are mainly interested finding in how the multi-thread using QPI to get the reading from the PCI card may effect the system. However, The incoming data traffic of QPI are always 0 and the outgoing data traffic are always 0 too. And I get some weird reading to. Is that possible that the reading is wrong?

Also I have two screen shots of that, but I find that the forum can not upload the picture. Is there any way that I can upload the picture so someone can help me analyze that?

Thank you very much.

Might it be that you have a second instance of PCM running? It might also be one that was not cleanly shut down?

 

0 Kudos
49 Replies
Mrunal_G_
Beginner
768 Views

Hi John,

Thank you very much for a detailed explanation. It is really very very helpful. I appreciate it a lot. 

The snippet of remote and local accesses I showed was just for a sample. My real access from a workload is as below. It gives me a bandwidth of around 10GB/sec for local access and 20GB/sec for remote access, when 48 threads execute on all 4 sockets. The local accesses being around 25M, and remote accesses being 50M. The theoretical QPI bandwidth is 32 GB/s.

Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses
   0   1.30        518 M      400 M       478 K              1488 K
   1   1.57        600 M      382 M       549 K              1732 K
   2   1.36        444 M      326 M       956 K               737 K
   3   1.32        508 M      384 M       473 K              1460 K
   4   1.39        461 M      331 M       944 K               815 K
   5   1.30        487 M      373 M       470 K              1381 K
   6   1.13        342 M      303 M       458 K               847 K
   7   1.27        473 M      374 M       387 K              1412 K
   8   1.28        378 M      295 M       847 K               592 K
   9   1.48        558 M      376 M       998 K              1124 K
  10   1.47        542 M      370 M       918 K              1143 K

 

0 Kudos
McCalpinJohn
Honored Contributor III
768 Views

Remote memory bandwidth on the Xeon E5-46xx (v1 "Sandy Bridge") is discussed in a forum thread at

https://software.intel.com/en-us/forums/topic/383121

I did not run cases with combinations of remote and local accesses, but I did see that remote bandwidths were in the range of ~4.4 GB/s for data on the nearest neighbor chip and ~3.8 GB/s for data on the chip in the opposite corner of the square topology.

Your results of 20 GB/s for remote accesses corresponds to ~5 GB/s per socket, which is pretty close to the ~4.4 GB/s socket-to-socket bandwidth that I saw.  This suggests that you are running into whatever "feature" is limiting sustained socket-to-socket bandwidth across the QPI interfaces, but as far as I know Intel has provided no explanation of why the sustained data bandwidth across QPI is such a low fraction of the theoretical peak data bandwidth.  Even the 2-socket Xeon E5 v1 boxes have relatively low QPI bandwidth efficiency, but this is much improved in Xeon E5 v3 when "home snoop" is enabled.

Unfortunately since I don't understand what is limiting sustained QPI data transfer bandwidth, I don't know how to look for a "signature" of this limiter using various uncore performance counters.  My guess is that there are not enough buffers for some class of transaction, but I have not been able to find anything specific for either the 2-socket or 4-socket boxes.

0 Kudos
McCalpinJohn
Honored Contributor III
768 Views

Quick follow-up on my note.

Using version 2.3 of the "Intel Memory Latency Checker", I ran a set of tests on a Xeon E5-4650 system to see how the remote memory access types influenced the remote memory bandwidth.

Access Type                    Local GB/s       1 hop Avg GB/s      2 hop Avg GB/s
----------------------------------------------------------------------------------
All Reads                        ~29.0              ~4.4                ~3.8
2 Reads + 1 Write                ~28.8              ~6.0                ~5.1
3 Reads + 1 Write                ~28.6              ~5.9                ~5.1
1 Read + 1 Write                 ~29.9              ~5.5                ~4.3
----------------------------------------------------------------------------------
2 Reads + 1 streaming Write      ~21.4              ~4.4                ~3.8
1 Read + 1 streaming Write       ~19.2              ~4.4                ~3.7
----------------------------------------------------------------------------------

This suggests that an aggregate of 20 GB/s of remote bandwidth across four sockets (5 GB/s per socket) is right in the middle of the range of expected sustainable values.

This system can sustain much higher bandwidths from local memory, so any modifications to increase local accesses should help performance.

The Xeon E5-46xx systems provide the ability to configure a very high capacity memory, but I suspect that you would need to go to the Xeon E7 systems to get both very high capacity and high remote bandwidth.  (I have not tested this, since I don't have a recent Xeon E7 for testing, but they have more QPI links, so I expect the aggregate remote BW to be higher.)

0 Kudos
Thomas_W_Intel
Employee
768 Views

On a Intel(R) Xeon(R) CPU E7-4890 v2, I was measuring:

Intel(R) Memory Latency Checker - v2.3

Command line parameters: --bandwidth_matrix

Using buffer size of 30.000MB
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using Read-only traffic type
        Memory node
 Socket      0       1       2       3
     0  35823.2 12480.5 12649.8 12643.5
     1  12455.3 35914.9 12661.3 12649.4
     2  12434.5 12629.5 35917.5 12657.9
     3  12441.2 12649.1 12658.5 35874.3

As always, your mileage might vary with DIMM type and population.

0 Kudos
McCalpinJohn
Honored Contributor III
768 Views

Thanks, Thomas!

The data I pulled was from one of the "largemem" nodes on the TACC Stampede system.  These are loaded with 1 TiB of RAM composed of 32 quad-rank 1.35V 32 GiB DIMMs.   Putting 2 quad-rank DIMMs on a channel is definitely going to reduce the DRAM channel frequency, so this is not a maximum bandwidth configuration even for the Xeon E5-4600.  (The DRAM configuration will only have a significant impact on the local bandwidth -- the remote bandwidth should be approximately the same for any reasonable DRAM configuration.)

I am glad to see that my prediction about the increased remote bandwidth on the Xeon E7 family is correct!

0 Kudos
Mrunal_G_
Beginner
768 Views

Thanks Thomas.

Thank you very much John for the details. I am citing here a paper, the latest state of the art database system, Hyper's, NUMA aware parallelism.  http://www-db.in.tum.de/~leis/papers/morsels.pdf

They use two 4 socket machines, one of which is, a 4-socket Nehalem EX (Intel Xeon X7560 at 2.3GHz). In Table 1 in the paper above they quote a peak local bandwidth of 82.6 GB/s for the database analytical workload, TPC-H benchmark, Query 1, where theoretical max bandwidth is 100GB/s.

I tried looking at the processor specs to check theoretical bandwidth, but the processor is discontinued. In any case, as per your comments, I think 100GB/s is then the cumulative bandwidth? which comes to around 25GB/s per socket ....because otherwise as per your and thomas's 

This paper also gives some details of % data accessed over QPI etc. for the entire 22 analytical queries in TPC-H benchmark. 

Thought of bringing it to your notice in case you find some thing interesting in their experiments. 

 

0 Kudos
Thomas_W_Intel
Employee
768 Views

Yes, you are right. If you divide the overall traffic by 4, you get approximately the memory traffic that you .can get on a single socket. (as measured by MLC).

Intel Xeon X7560 processor ("Nehalem-EX") was followed by Intel Xeon E7 processors ("Westmere-EX"), which was followed by Intel Xeon E7 v2 processors ("Ivybridge-EX"). The latter are the ones that I was running MLC on.

0 Kudos
Mrunal_G_
Beginner
768 Views

Thank you Thomas for the quick reply. For the processor that I have Intel® Xeon® Processor E5-4657L v2 

The Intel ark page states following as the bandwidth. I was wondering if the earlier processor had only 25GB/sec per socket bandwidth,

then the processor I have, has 60GB/sec per socket bandwidth and 16GB/sec per socket QPI bandwidth. Is this correct? 

Max Memory Bandwidth 59.7 GB/s

 

Intel® QPI Speed 8 GT/s
0 Kudos
Reply