Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Stream benchmark on Skylake SP Xeon - results too low?

Alexander_S_2
Novice
1,395 Views

I am trying to evaluate the memory performance of a HP Z8 G4 workstation equipped with 2x Intel Xeon Gold 6136. The system has 192GB of DDR4-2666 RAM populated as 12x16GB. So I ran the stream.c benchmark from https://www.cs.virginia.edu/stream/

The results I get seem too low, see attachment (don't know how to embed images here). 122467.4MB/s on all 24 cores for the triad benchmark. Re-running yields consistent results. And based on the trend with 1...24 cores pinning seems to work correctly.

I used gcc (4.8.5) as a compiler: gcc -fopenmp -O -DSTREAM_ARRAY_SIZE=100000000 -mcmodel=medium stream.c -o stream

Environment settings: OMP_PROC_BIND=close, OMP_NUM_THREADS=1...24

Operating system: OpenSUSE 42.3, Kernel 4.4.180-102-default

With the same setup, I get 83860.2MB/s for the triad benchmark on a system with 2x Xeon 2687W v3 with 8x16GB DDR4-2133. Based on this result, I would have expected a higher result for the Skylake system due to a theoretical memory bandwidth that is higher by a factor of 1.87. And I found this paper which boasts even higher results for a dual CPU Skylake SP system.

Am i doing something wrong or could there be something wrong with the system?

 

What I tried in addition to this:

Compile with different optimisations (-O3, -Ofast) and target architectures (-march=native, which chooses core-avx2): low single-digit percentage improvements. I know, AVX512 would be better. But my compiler does not recognise it and even with AVX2 my results seem too low when compared to the linked publication.

Used OMP_PLACES=cores instead of relying on OMP_PROC_BIND=close: no significant difference.

Run with numactl -l: no difference

Compile with much larger array sizes: no difference

Enabled sub-NUMA clustering: low single-digit percentage improvements.

Checked CPU clock speeds during execution: hovering around 3.3-3.6 GHz

Used gcc version 8.2.1 with flags -O3 -march=skylake-avx512: no significant difference compared to gcc 4.8.5 with -O3 -march=native

Another reason I ask: I have 2 systems with 2x Xeon 2687W v4 (8x16GB DDR4-2400) which score even lower than the Xeon v3 system.

 

lscpu output with sub-NUMA clustering disabled

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
Stepping:              4
CPU MHz:               1313.978
CPU max MHz:           3700.0000
CPU min MHz:           1200.0000
BogoMIPS:              5985.95
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-11
NUMA node1 CPU(s):     12-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm ibrs flush_l1d md_clear constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_single pln pts dtherm hwp hwp_act_window hwp_epp hwp_pkg_req intel_pt ssbd ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc pku ospke

dmidecode -t 17 output:

# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.
# SMBIOS implementations newer than version 3.0 are not
# fully supported by this version of dmidecode.

Handle 0x0011, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM1
        Bank Locator: CPU0
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA42
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0013, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM2
        Bank Locator: CPU0
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x0014, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM3
        Bank Locator: CPU0
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA33
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0016, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM4
        Bank Locator: CPU0
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x0017, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM5
        Bank Locator: CPU0
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AB74
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0019, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM6
        Bank Locator: CPU0
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x001A, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM7
        Bank Locator: CPU0
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x001B, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM8
        Bank Locator: CPU0
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA3E
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x001D, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM9
        Bank Locator: CPU0
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x001E, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM10
        Bank Locator: CPU0
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA52
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0020, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM11
        Bank Locator: CPU0
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x0021, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x000E
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU0-DIMM12
        Bank Locator: CPU0
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA53
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0025, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM1
        Bank Locator: CPU1
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA45
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0027, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM2
        Bank Locator: CPU1
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x0028, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM3
        Bank Locator: CPU1
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA4F
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x002A, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM4
        Bank Locator: CPU1
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x002B, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM5
        Bank Locator: CPU1
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA4D
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x002D, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM6
        Bank Locator: CPU1
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x002E, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM7
        Bank Locator: CPU1
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x002F, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM8
        Bank Locator: CPU1
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA43
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0031, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM9
        Bank Locator: CPU1
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x0032, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM10
        Bank Locator: CPU1
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA46
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0034, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM11
        Bank Locator: CPU1
        Type: Unknown
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x0035, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0023
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU1-DIMM12
        Bank Locator: CPU1
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MHz
        Manufacturer: Hynix
        Serial Number: 2E81AA4E
        Asset Tag: Not Specified
        Part Number: HMA82GR7CJR4N-VK
        Rank: 1
        Configured Clock Speed: 2666 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

 

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
1,395 Views

The biggest problem is using gcc -- you won't get streaming stores.  Using "ordinary" stores increases the memory traffic by 50% for the Copy and Scale kernels and by 33% for the Add and Triad kernels.

Single-rank DIMMs don't have enough banks to deliver full bandwidth for workloads with mixed reads and writes.  On Xeon E5-2690 v3 (Haswell), I saw 10%-20% slowdowns with one single-rank DIMM per channel instead of one dual-rank DIMM per channel.

Disabling streaming stores with the Intel compiler gives ~145 GB/s using all 24 cores per socket on a 2s Xeon Platinum 8160 system (with one dual-rank DIMM per channel).  Your 122 GB/s is about 16% lower, which is in the expected range.

View solution in original post

0 Kudos
3 Replies
McCalpinJohn
Honored Contributor III
1,396 Views

The biggest problem is using gcc -- you won't get streaming stores.  Using "ordinary" stores increases the memory traffic by 50% for the Copy and Scale kernels and by 33% for the Add and Triad kernels.

Single-rank DIMMs don't have enough banks to deliver full bandwidth for workloads with mixed reads and writes.  On Xeon E5-2690 v3 (Haswell), I saw 10%-20% slowdowns with one single-rank DIMM per channel instead of one dual-rank DIMM per channel.

Disabling streaming stores with the Intel compiler gives ~145 GB/s using all 24 cores per socket on a 2s Xeon Platinum 8160 system (with one dual-rank DIMM per channel).  Your 122 GB/s is about 16% lower, which is in the expected range.

0 Kudos
Alexander_S_2
Novice
1,395 Views

Ok, thank you for your input.

Now at least I know why I try to insist on dual-rank DIMMs when purchasing new systems.

So if I used the Fortran version of the benchmark with ifort I should be able to get streaming stores and thus higher measured values? Gfortran can't use streaming stores either?

0 Kudos
McCalpinJohn
Honored Contributor III
1,395 Views

I don't think that the GCC Fortran compiler will generate streaming stores, either, but it has been many years since I have used it....

At optimization level O3 with gcc you should get the Copy kernel replaced by a manually optimized memcopy routine that uses streaming stores.  If the Copy kernel is a lot faster than the Scale kernel when using all cores, this may be happening.

0 Kudos
Reply