How to saturate the QPI link

Doru_Adrian_Thom_P_ · ‎09-19-2016

Hello.

Have a question regarding the QPI link.

So on the documentation it is said that the processor has 9.6 GT/s, how does translate to GB/s ... Does it mean 9.6 GT/s x 2 bytes = 19.2 GB/s?

If the processor has 2 QPI links does that mean that the bandwidth towards the other socket is 2 times 19.2 GB/s, or I can use it to transfer 19.2GB/s to and from at the same time?

I am assuming that 19.2 GB/s is the theoretical bandwidth of the QPI link...

How much can one actually achieve? 6 GB/s or what?

Thanks,
Thom

McCalpinJohn · ‎09-21-2016

The achievable bandwidth on QPI depends on (at least):

The processor generation (v1, v2, v3, v4)
The "snoop mode" of the processors (applicable to v2, v3, v4)
The traffic type (reads, writes, streaming stores, unidirectional, bidirectional, etc)
The core, uncore, and DRAM frequencies active at the time of execution for each processor socket
The C-state configuration of the processors (especially C1E)

The "Intel Memory Latency Checker" will run local and remote bandwidth tests using a variety of traffic types. It does not report the "snoop mode" of the system, so you will have to figure that out another way.

For the default "bandwidth matrix" test (all reads), I see about 62-63 GB/s for local accesses (on each socket) for either "Early Snoop" or "Home Snoop" modes on all of the processors I tested (see list below) except for the Xeon E5-2667 v3. (The Xeon E5-2667 v3 has only one "Home Agent" to handle all four DRAM channels, which reduces the bandwidth to about 52.9 GB/s for local read accesses on this test.)

For "Home Snoop" mode, all of these systems deliver very close to 30.6 GB/s for remote accesses. This is for one socket doing only reads from the other socket, while the cores on the "remote" socket are not accessing memory. The peak bandwidth between sockets is 9.6 Gbs * 2 Bytes/link * 2 links = 38.4 GB/s, but at least 1/9 of that is required for data headers, leaving a "peak payload bandwidth" of 34.1 GB/s. The observed 30.6 GB/s is just shy of 90% of the peak payload bandwidth, which is an excellent result.

For "Early Snoop" mode, the remote read performance drops significantly. We try not to run our systems in this mode, so I don't have as many results, but I see 19.0-19.7 GB/s on a Xeon E5-2680 v3 and 24.7-25.1 GB/s on a Xeon E5-2697 v3. Unlike the case with "Home Snoop", the performance here appears strongly dependent on the processor frequency. Since latency is fairly strongly dependent on processor core and uncore frequency in these configurations, I interpret this as an indication that not enough buffers are allocated for remote memory accesses in "Early Snoop" mode. (There are other possible explanations, but this gets to a level of detail that Intel does not typically discuss in public.)

The performance characteristics for other combinations of read and write traffic all have their own stories, but in every case "Early Snoop" delivers much lower bandwidth than "Home Snoop".

Xeon E5 v1 only supports "Early Snoop". Xeon E5 v2 has some small improvements to "Early Snoop" performance, but adds the "Home Snoop" option which increases performance by a lot more. Xeon E5 v3 results are above. I don't have any Xeon E5 v4 nodes to test, but these come with some new snoop options so the story will become more complex....

List of Xeon E5 v3 processors tested: Xeon E5-2667 v3 (8c, 3.2 GHz), Xeon E5-2660 v3 (10c, 2.6 GHz), Xeon E5-2680 v3 (12c, 2.5 GHz), Xeon E5-2690 v3 (12c, 2.6 GHz), Xeon E5-2697 v3 (14c, 2.6 GHz)