I'm not able to achieve higher than 12 GB/s throughput when accessing memory across the QPI link. My understanding is that a 20-lane 6.4 GT/s link should have 12.8 GB/s per direction for a total of 25.6 GB/s counting both directions. So, I seem to be getting about half.
The scenario: threads on socket A are accessing memory allocated by socket B. I've enforced this in two different ways, both times getting the same result: 1) use numactl to bind threads to node-0, whereas memory allocations are bound to node-1; 2) use KMP_AFFINITY to allocate 8 threads in a compact way (no HT), all threads allocate memory, then threads 4-7 access memory allocated by threads 0-3 (while threads 0-3 wait for threads 4-7 at a barrier).
A thread in this case reads 2 arrays, adds them element-wise, and writes the result back to the same arrays. So, same number of loads and stores. Compiled with icc, OpenMP used for threading. Same throughputs for 16MB and larger arrays.
Hardware: SuperMicro X8DTG-DF motherboard, dual X5550, QPI should be 6.4 GT/s, each socket has 2 banks in each channel populated (so 3x2 DIMMS per socket) at 1066 MHz.
What's even stranger, is that when I modify case (2) above so that instead of idling threads 0-3 access memory allocated by threads 4-7 at the same time, I get a near doubling of throughput to 19 GB/s (which coincidentally is very close to a single socket throughput in my case).
Am I missing something, or are the 20 lanes partitioned between the two sockets (10 for each controller)?
thanks for reply. I should have been clearer in the description of my case - I cannot get more than 12 GB/s when communication across QPI is happening in both directions. Here are the two cases I'm observing:
1)socket A reads and writes memory allocated on socket B, socket B idles: at most 12 GB/s. 2)both sockets read and write each other's memory (A accesses B's memory, B accesses A's memory): up to 19 GB/s.
I understand (2) behaves as expected - I'm getting ~75% of theoretical bandwidth. It's case (1) that perplexes me since I think I should be able to get 19 GB/s - QPI sees traffic in both directions. In fact, both cases are generating duplex QPI traffic, the difference is that communication is initiated by one socket in (1), whereas in (2) both sockets initiated communication.
you might try theIntel Performance Counter Monitor to measure estimations of both memory controllers bandwidth and QPI link data throughput. The tool measures this metrics in real-time during the running time of your program. Alternatively, you can instrument your program by calling its API to measure the metrics directly in your code.