QPI and DRAM Bandwidth Mismatch

GGil · ‎03-14-2018

Hello,

I want to do measure the performance of a simple application I wrote (string matching) when 8 instances of it (8 processes, not threads, thus independent of each other) are running on CPU B, while the data files reside in MEM A, i.e, data will move from through QPI (using Intel Xeon E2630v3, disabled early snooping in the BIOS to achieve high QPI BW, HyperThreading is disabled, TurboBoost disabled).

I implemented my string matching with MMAP, to hopefully mitigate OS overheads (system calls) once the page cache is "hot" (all data is in the page cache and mapped). Memory size is large enough to accommodate all the files (no swapping occurs).

To get to this scenario, I run the processes as such:

numactl -C 8 -m 0 STR_MTCH file.1
numactl -C 9 -m 0 STR_MTCH file.2
...
numactl -C 15 -m 0 STR_MTCH file.8

Using PCM I see that the the DRAM bandwidth of socket 0 is ~23GBps, whereas the QPI bandwidth is ~13GBps. I can't understand how does it make sense, where does this bandwidth mismatch come from? it seems like the DRAM BW is 2x the QPI BW... if I multiply the DRAM BW with the execution time I get the file size (makes sense), on the other hand, the QPI BW max is ~16GBps if I'm not mistaken.

pcm.x output while running 8 instances of my simple program:

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI |  L3OCC | TEMP

   0    0     0.00   0.44   0.01    0.72    8889      119 K    0.93    0.23    0.00    0.01     1568     58
   1    0     0.00   0.47   0.01    0.87      11 K     62 K    0.82    0.30    0.00    0.01     2240     60
   2    0     0.00   0.40   0.00    0.85    4806       42 K    0.89    0.28    0.00    0.01     1376     60
   3    0     0.00   0.56   0.01    0.91    9567       63 K    0.85    0.35    0.00    0.01     5408     61
   4    0     0.00   0.56   0.01    0.92    7507       58 K    0.87    0.36    0.00    0.01     1888     58
   5    0     0.00   0.59   0.01    0.91    7476       54 K    0.86    0.40    0.00    0.01     1536     58
   6    0     0.00   0.38   0.00    0.81    1920       26 K    0.93    0.29    0.00    0.01     1056     56
   7    0     0.00   0.54   0.00    0.93      11 K     45 K    0.75    0.34    0.00    0.01     1152     60
   8    1     0.53   0.54   0.98    1.00      14 M     16 M    0.15    0.15    0.01    0.01     2464     53
   9    1     0.53   0.54   0.98    1.00      14 M     17 M    0.15    0.15    0.01    0.01     2528     54
  10    1     0.53   0.54   0.99    1.00      14 M     17 M    0.15    0.16    0.01    0.01     2336     54
  11    1     0.53   0.54   0.99    1.00      14 M     17 M    0.15    0.15    0.01    0.01     2752     54
  12    1     0.53   0.54   0.98    1.00      14 M     17 M    0.15    0.15    0.01    0.01     2592     56
  13    1     0.53   0.54   0.98    1.00      14 M     17 M    0.15    0.16    0.01    0.01     2720     54
  14    1     0.53   0.54   0.99    1.00      14 M     17 M    0.15    0.16    0.01    0.01     2720     54
  15    1     0.53   0.54   0.98    1.00      14 M     17 M    0.15    0.15    0.01    0.01     2912     53
---------------------------------------------------------------------------------------------------------------
 SKT    0     0.00   0.50   0.01    0.85      63 K    472 K    0.87    0.31    0.00    0.01    16224     47
 SKT    1     0.53   0.54   0.99    1.00     116 M    137 M    0.15    0.15    0.01    0.01    21024     48
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.27   0.54   0.50    1.00     116 M    137 M    0.15    0.15    0.01    0.01     N/A      N/A

 Instructions retired:   10 G ; Active cycles:   19 G ; Time (TSC): 2405 Mticks ; C0 (active,non-halted) core residency: 49.60 %

 C1 core residency: 0.41 %; C3 core residency: 0.03 %; C6 core residency: 49.96 %; C7 core residency: 0.00 %;
 C2 package residency: 47.37 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;

 PHYSICAL CORE IPC                 : 0.54 => corresponds to 13.52 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.27 => corresponds to 6.70 % core utilization over time interval
---------------------------------------------------------------------------------------------------------------

Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

               QPI0    |  QPI0
---------------------------------------------------------------------------------------------------------------
 SKT    0       13 G   |   82%
 SKT    1     1991 M   |   12%
---------------------------------------------------------------------------------------------------------------
Total QPI outgoing data and non-data traffic:   15 G

          |  READ |  WRITE | CPU energy | DIMM energy
---------------------------------------------------------------------------------------------------------------
 SKT   0    23.34     0.84      24.17      12.11
 SKT   1     0.07     0.02      44.75       7.32
---------------------------------------------------------------------------------------------------------------
       *    23.40     0.86      68.92      19.42

Thanks,
Gil.

McCalpinJohn · ‎03-14-2018

A lot of the QPI traffic counters on Intel processors are broken (i.e., give incorrect results), though I seem to recall that at least some of the QPI data traffic events on Xeon E5 v3 worked correctly. I did not ever test these on a single-QPI-link processor, and it is possible that either the counter or the interpretation of the counter results are not correct here.

If you want to validate the counters (and their interpretation by the pcm.x tool), I would recommend running something with well-understood memory traffic (like STREAM) in a cross-socket configuration.

I would recommend trying STREAM compiled with and without streaming stores. The default array size of 10,000,000 is perfect for running on a single socket of the Xeon E5 2630 v3 (20 MiB L3 cache), since meets the criteria of each array being 4x the aggregate cache size. I would increase the NTIMES parameter to 100 to reduce the uncertainty in traffic counts related to the initial instantiation of the pages.

When compiled with streaming stores:

Read Bytes = 8 * STREAM_ARRAY_SIZE * (NTIMES * 6 + ~4) <-- the ~4 is overhead from initialization and validation
Write Bytes = 8 * STREAM_ARRAY_SIZE * (NTIMES * 6 + ~4) <-- the ~4 is overhead from initialization

When compiled without streaming stores the write bytes are the same, but the read bytes are increased by an amount equal to the write bytes:

Read Bytes = 8 * STREAM_ARRAY_SIZE * (NTIMES * 10 + ~4)