- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I want to do measure the performance of a simple application I wrote (string matching) when 8 instances of it (8 processes, not threads, thus independent of each other) are running on CPU B, while the data files reside in MEM A, i.e, data will move from through QPI (using Intel Xeon E2630v3, disabled early snooping in the BIOS to achieve high QPI BW, HyperThreading is disabled, TurboBoost disabled).
I implemented my string matching with MMAP, to hopefully mitigate OS overheads (system calls) once the page cache is "hot" (all data is in the page cache and mapped). Memory size is large enough to accommodate all the files (no swapping occurs).
To get to this scenario, I run the processes as such:
numactl -C 8 -m 0 STR_MTCH file.1 numactl -C 9 -m 0 STR_MTCH file.2 ... numactl -C 15 -m 0 STR_MTCH file.8
Using PCM I see that the the DRAM bandwidth of socket 0 is ~23GBps, whereas the QPI bandwidth is ~13GBps. I can't understand how does it make sense, where does this bandwidth mismatch come from? it seems like the DRAM BW is 2x the QPI BW... if I multiply the DRAM BW with the execution time I get the file size (makes sense), on the other hand, the QPI BW max is ~16GBps if I'm not mistaken.
pcm.x output while running 8 instances of my simple program:
Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | L3OCC | TEMP 0 0 0.00 0.44 0.01 0.72 8889 119 K 0.93 0.23 0.00 0.01 1568 58 1 0 0.00 0.47 0.01 0.87 11 K 62 K 0.82 0.30 0.00 0.01 2240 60 2 0 0.00 0.40 0.00 0.85 4806 42 K 0.89 0.28 0.00 0.01 1376 60 3 0 0.00 0.56 0.01 0.91 9567 63 K 0.85 0.35 0.00 0.01 5408 61 4 0 0.00 0.56 0.01 0.92 7507 58 K 0.87 0.36 0.00 0.01 1888 58 5 0 0.00 0.59 0.01 0.91 7476 54 K 0.86 0.40 0.00 0.01 1536 58 6 0 0.00 0.38 0.00 0.81 1920 26 K 0.93 0.29 0.00 0.01 1056 56 7 0 0.00 0.54 0.00 0.93 11 K 45 K 0.75 0.34 0.00 0.01 1152 60 8 1 0.53 0.54 0.98 1.00 14 M 16 M 0.15 0.15 0.01 0.01 2464 53 9 1 0.53 0.54 0.98 1.00 14 M 17 M 0.15 0.15 0.01 0.01 2528 54 10 1 0.53 0.54 0.99 1.00 14 M 17 M 0.15 0.16 0.01 0.01 2336 54 11 1 0.53 0.54 0.99 1.00 14 M 17 M 0.15 0.15 0.01 0.01 2752 54 12 1 0.53 0.54 0.98 1.00 14 M 17 M 0.15 0.15 0.01 0.01 2592 56 13 1 0.53 0.54 0.98 1.00 14 M 17 M 0.15 0.16 0.01 0.01 2720 54 14 1 0.53 0.54 0.99 1.00 14 M 17 M 0.15 0.16 0.01 0.01 2720 54 15 1 0.53 0.54 0.98 1.00 14 M 17 M 0.15 0.15 0.01 0.01 2912 53 --------------------------------------------------------------------------------------------------------------- SKT 0 0.00 0.50 0.01 0.85 63 K 472 K 0.87 0.31 0.00 0.01 16224 47 SKT 1 0.53 0.54 0.99 1.00 116 M 137 M 0.15 0.15 0.01 0.01 21024 48 --------------------------------------------------------------------------------------------------------------- TOTAL * 0.27 0.54 0.50 1.00 116 M 137 M 0.15 0.15 0.01 0.01 N/A N/A Instructions retired: 10 G ; Active cycles: 19 G ; Time (TSC): 2405 Mticks ; C0 (active,non-halted) core residency: 49.60 % C1 core residency: 0.41 %; C3 core residency: 0.03 %; C6 core residency: 49.96 %; C7 core residency: 0.00 %; C2 package residency: 47.37 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %; PHYSICAL CORE IPC : 0.54 => corresponds to 13.52 % utilization for cores in active state Instructions per nominal CPU cycle: 0.27 => corresponds to 6.70 % core utilization over time interval --------------------------------------------------------------------------------------------------------------- Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links): QPI0 | QPI0 --------------------------------------------------------------------------------------------------------------- SKT 0 13 G | 82% SKT 1 1991 M | 12% --------------------------------------------------------------------------------------------------------------- Total QPI outgoing data and non-data traffic: 15 G | READ | WRITE | CPU energy | DIMM energy --------------------------------------------------------------------------------------------------------------- SKT 0 23.34 0.84 24.17 12.11 SKT 1 0.07 0.02 44.75 7.32 --------------------------------------------------------------------------------------------------------------- * 23.40 0.86 68.92 19.42
Thanks,
Gil.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A lot of the QPI traffic counters on Intel processors are broken (i.e., give incorrect results), though I seem to recall that at least some of the QPI data traffic events on Xeon E5 v3 worked correctly. I did not ever test these on a single-QPI-link processor, and it is possible that either the counter or the interpretation of the counter results are not correct here.
If you want to validate the counters (and their interpretation by the pcm.x tool), I would recommend running something with well-understood memory traffic (like STREAM) in a cross-socket configuration.
I would recommend trying STREAM compiled with and without streaming stores. The default array size of 10,000,000 is perfect for running on a single socket of the Xeon E5 2630 v3 (20 MiB L3 cache), since meets the criteria of each array being 4x the aggregate cache size. I would increase the NTIMES parameter to 100 to reduce the uncertainty in traffic counts related to the initial instantiation of the pages.
When compiled with streaming stores:
- Read Bytes = 8 * STREAM_ARRAY_SIZE * (NTIMES * 6 + ~4) <-- the ~4 is overhead from initialization and validation
- Write Bytes = 8 * STREAM_ARRAY_SIZE * (NTIMES * 6 + ~4) <-- the ~4 is overhead from initialization
When compiled without streaming stores the write bytes are the same, but the read bytes are increased by an amount equal to the write bytes:
- Read Bytes = 8 * STREAM_ARRAY_SIZE * (NTIMES * 10 + ~4)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page