- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to evaluate the memory performance of a HP Z8 G4 workstation equipped with 2x Intel Xeon Gold 6136. The system has 192GB of DDR4-2666 RAM populated as 12x16GB. So I ran the stream.c benchmark from https://www.cs.virginia.edu/stream/
The results I get seem too low, see attachment (don't know how to embed images here). 122467.4MB/s on all 24 cores for the triad benchmark. Re-running yields consistent results. And based on the trend with 1...24 cores pinning seems to work correctly.
I used gcc (4.8.5) as a compiler: gcc -fopenmp -O -DSTREAM_ARRAY_SIZE=100000000 -mcmodel=medium stream.c -o stream
Environment settings: OMP_PROC_BIND=close, OMP_NUM_THREADS=1...24
Operating system: OpenSUSE 42.3, Kernel 4.4.180-102-default
With the same setup, I get 83860.2MB/s for the triad benchmark on a system with 2x Xeon 2687W v3 with 8x16GB DDR4-2133. Based on this result, I would have expected a higher result for the Skylake system due to a theoretical memory bandwidth that is higher by a factor of 1.87. And I found this paper which boasts even higher results for a dual CPU Skylake SP system.
Am i doing something wrong or could there be something wrong with the system?
What I tried in addition to this:
Compile with different optimisations (-O3, -Ofast) and target architectures (-march=native, which chooses core-avx2): low single-digit percentage improvements. I know, AVX512 would be better. But my compiler does not recognise it and even with AVX2 my results seem too low when compared to the linked publication.
Used OMP_PLACES=cores instead of relying on OMP_PROC_BIND=close: no significant difference.
Run with numactl -l: no difference
Compile with much larger array sizes: no difference
Enabled sub-NUMA clustering: low single-digit percentage improvements.
Checked CPU clock speeds during execution: hovering around 3.3-3.6 GHz
Used gcc version 8.2.1 with flags -O3 -march=skylake-avx512: no significant difference compared to gcc 4.8.5 with -O3 -march=native
Another reason I ask: I have 2 systems with 2x Xeon 2687W v4 (8x16GB DDR4-2400) which score even lower than the Xeon v3 system.
lscpu output with sub-NUMA clustering disabled
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz Stepping: 4 CPU MHz: 1313.978 CPU max MHz: 3700.0000 CPU min MHz: 1200.0000 BogoMIPS: 5985.95 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-11 NUMA node1 CPU(s): 12-23 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm ibrs flush_l1d md_clear constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_single pln pts dtherm hwp hwp_act_window hwp_epp hwp_pkg_req intel_pt ssbd ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc pku ospke
dmidecode -t 17 output:
# dmidecode 3.0 Getting SMBIOS data from sysfs. SMBIOS 3.2.0 present. # SMBIOS implementations newer than version 3.0 are not # fully supported by this version of dmidecode. Handle 0x0011, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU0-DIMM1 Bank Locator: CPU0 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA42 Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x0013, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU0-DIMM2 Bank Locator: CPU0 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0014, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU0-DIMM3 Bank Locator: CPU0 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA33 Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x0016, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU0-DIMM4 Bank Locator: CPU0 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0017, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU0-DIMM5 Bank Locator: CPU0 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AB74 Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x0019, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU0-DIMM6 Bank Locator: CPU0 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x001A, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU0-DIMM7 Bank Locator: CPU0 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x001B, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU0-DIMM8 Bank Locator: CPU0 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA3E Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x001D, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU0-DIMM9 Bank Locator: CPU0 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x001E, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU0-DIMM10 Bank Locator: CPU0 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA52 Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x0020, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU0-DIMM11 Bank Locator: CPU0 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0021, DMI type 17, 40 bytes Memory Device Array Handle: 0x000E Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU0-DIMM12 Bank Locator: CPU0 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA53 Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x0025, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU1-DIMM1 Bank Locator: CPU1 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA45 Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x0027, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU1-DIMM2 Bank Locator: CPU1 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0028, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU1-DIMM3 Bank Locator: CPU1 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA4F Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x002A, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU1-DIMM4 Bank Locator: CPU1 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x002B, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU1-DIMM5 Bank Locator: CPU1 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA4D Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x002D, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU1-DIMM6 Bank Locator: CPU1 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x002E, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU1-DIMM7 Bank Locator: CPU1 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x002F, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU1-DIMM8 Bank Locator: CPU1 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA43 Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x0031, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU1-DIMM9 Bank Locator: CPU1 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0032, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU1-DIMM10 Bank Locator: CPU1 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA46 Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x0034, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: DIMM Set: None Locator: CPU1-DIMM11 Bank Locator: CPU1 Type: Unknown Type Detail: None Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0035, DMI type 17, 40 bytes Memory Device Array Handle: 0x0023 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: CPU1-DIMM12 Bank Locator: CPU1 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2666 MHz Manufacturer: Hynix Serial Number: 2E81AA4E Asset Tag: Not Specified Part Number: HMA82GR7CJR4N-VK Rank: 1 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The biggest problem is using gcc -- you won't get streaming stores. Using "ordinary" stores increases the memory traffic by 50% for the Copy and Scale kernels and by 33% for the Add and Triad kernels.
Single-rank DIMMs don't have enough banks to deliver full bandwidth for workloads with mixed reads and writes. On Xeon E5-2690 v3 (Haswell), I saw 10%-20% slowdowns with one single-rank DIMM per channel instead of one dual-rank DIMM per channel.
Disabling streaming stores with the Intel compiler gives ~145 GB/s using all 24 cores per socket on a 2s Xeon Platinum 8160 system (with one dual-rank DIMM per channel). Your 122 GB/s is about 16% lower, which is in the expected range.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The biggest problem is using gcc -- you won't get streaming stores. Using "ordinary" stores increases the memory traffic by 50% for the Copy and Scale kernels and by 33% for the Add and Triad kernels.
Single-rank DIMMs don't have enough banks to deliver full bandwidth for workloads with mixed reads and writes. On Xeon E5-2690 v3 (Haswell), I saw 10%-20% slowdowns with one single-rank DIMM per channel instead of one dual-rank DIMM per channel.
Disabling streaming stores with the Intel compiler gives ~145 GB/s using all 24 cores per socket on a 2s Xeon Platinum 8160 system (with one dual-rank DIMM per channel). Your 122 GB/s is about 16% lower, which is in the expected range.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, thank you for your input.
Now at least I know why I try to insist on dual-rank DIMMs when purchasing new systems.
So if I used the Fortran version of the benchmark with ifort I should be able to get streaming stores and thus higher measured values? Gfortran can't use streaming stores either?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think that the GCC Fortran compiler will generate streaming stores, either, but it has been many years since I have used it....
At optimization level O3 with gcc you should get the Copy kernel replaced by a manually optimized memcopy routine that uses streaming stores. If the Copy kernel is a lot faster than the Scale kernel when using all cores, this may be happening.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page