Re: Comparison of STREAM benchmark over different Intel Xeon Servers

sunilk · ‎07-10-2024

I have conducted some experiments with STREAM benchmarks on the following servers and observed unexpected trends:

Characteristics of Machine A:

It is a dual socket machine.

Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz (Codename: Skylake)

Cores per socket = 12; L3 cache = 19.25 MB; TDP=125W; Powercap range: [65W to 125W]

Total RAM at machine A = 192GB (96 GB per socket)

Characteristics of Machine B:

It is a quad socket machine.

Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz (Codename: Cooperlake)

Cores per socket = 18; L3 cache =24.75 MB; TDP=150W; Powercap range: [82W to 150W]

Total RAM at machine B = 256GB (64 GB per socket).

Characteristics of Machine C:

It is a dual socket machine.

Intel(R) Xeon(R) Gold 5418Y CPU @ 2.0GHz (Codename: Sapphire)

Cores per socket = 24; L3 cache = 45 MB; TDP=185W; Powercap range: [115W to 185W]

Total RAM at machine C = 192GB (96 GB per socket)

I ran the STREAM benchmark, where the total memory usage is 2.2 GB. I observed that the execution time on machine A decreases when we increase the power cap. However, the execution time remains constant on machines B and C when we increase the power cap from MIN to MAX.

I ran another version of STREAM, where the total memory usage is quite low (22 MB). The execution time decreases on machines A, B, and C when we increase the power cap. This suggests that we can observe certain trends when DRAM access is negligible or when pages fit into the cache.

I am unable to understand why machines B and C produce the lowest performance for the STREAM benchmark, even though B and C are of a newer generation compared to machine A.

sunilk · ‎07-11-2024

Hi @McCalpinJohn,

It would be a great help if you could share your thoughts on this.

Thanks!

sunilk · ‎07-16-2024

It would be great if anyone could explain this behaviour reported in the above post.

@connorimes, @McCalpinJohn

Thanks

McCalpinJohn · ‎07-18-2024

Performance analysis requires detailed specific information about the configurations of the hardware, software, and execution environment of the measurements.... For these systems, required information includes:

Hardware configuration -- the information above plus
- DIMM configuration: DIMM size, number of ranks per DIMM, DRAM max frequency, and location of the installed DIMMs in the various channels
  - The Xeon Gold 5318H system has 6 DRAM channels/socket, but the reported memory is not divisible by 3. Why?
  - The Xeon Gold 5418Y processor has 8 DRAM channels/socket, but the reported memory is divisible by 3. Why?
- Details on how power capping was controlled
- HyperThreading enabled/disabled?
Software
- Benchmark code
  - any changes from the baseline stream.c revision 5.10
  - any changes from the default size of 80,000,000 "double" elements per array
- Compiler name/version and compiler options used
Execution environment
- OS type & version
- OS configuration with respect to HugePages, Automatic NUMA Page Migration
- Number of OpenMP threads requested
- Execution environment with respect to binding of memory and binding of process threads (sometimes controlled by environment variables, sometimes by "launcher" programs)
Results
- Problem size and number of iterations output from the code
- Performance output of the code (Copy, Scale, Add, Triad output lines)
- Report on the number of executions of each configuration and (if run more than once) on the variability in the performance results
- For cases with non-standard configurations (such as power-capping), externally measured values for average frequency sustained during code execution and actual power consumption during code execution are essential for understanding what is happening.

In server-class multicore processors with 4 or more DRAM channels, multiple cores are always required to reach maximum memory bandwidth, but the number of cores needed depends on the processor generation, the total DRAM bandwidth available, and whether or not streaming store instructions are used. (Also note that in some processor generations, streaming stores are silently converted to ordinary stores in four-socket or larger systems.)

STREAM performance depends on number of cores used and the average frequency of those cores in non-trivial ways. Depending on the processor generation, I have seen significant (>20%) reductions in power consumption by running STREAM with fewer cores running at frequencies below the default (nominal) value.

sunilk · ‎07-20-2024

Thanks, John, for the detailed answer. I will surely provide all the details you asked. Give me some time; I will reply soon with all the necessary details.

sunilk · ‎07-23-2024

Hi @McCalpinJohn,
I have included all the details you need.

Additional Hardware configuration of Xeon Gold 5318H
- DIMM configuration: (Part Number: MTA18ASF4G72PDZ-3G2F1)
  - DIMM size: 32GB DDR4 SDRAM
  - Number of ranks per DIMM: 2 (2Rx8)
  - DRAM max frequency: 3200 MHz
  - Location of all DIMMs
    - Every socket has six channels (Channel code: 0 to 5) (each channel has two DIMM slots (DIMM code: 0 to 1))
    - Location of two installed DIMM chips at each socket (same location on all four sockets)
      - The first chip at Channel0_DIMM0
      - The second chip at Channel3_DIMM0
  - The Xeon Gold 5318H system has 6 DRAM channels/sockets, but the reported memory is not divisible by 3. Why?
    - Every socket has two chips of 32GB RAM (64GB per socket). It is the default configuration provided by the vendor. In your opinion, what could be the optimal number of chips of 32GB that have to be installed?
- DRAM TDP per Socket: 9W (combined TDP of DIMM chips installed on each socket)
- Details on how power capping was controlled
  - Power capping is applied by using RAPL_POWER_LIMIT_MSR (0x610)
- HyperThreading is turned on/off: Enabled, but all threads are pinned to physical cores.
- Turbo Frequency: Enable
Additional Hardware configuration of Xeon Gold 5418Y
- DIMM configuration: (Part Number: M321R4GA0BB0-CQKET)
  - DIMM size: 32GB DDR5 SDRAM
  - Number of ranks per DIMM: 1 (1Rx4)
  - DRAM max frequency: 4800 MHz
  - Location of all DIMMs
    - Every socket has eight channels (Channel code: 0 to 7) (each channel has one DIMM slot (DIMM code: 0))
    - Location of three installed DIMM chips at each socket
      - The first chip at Channel0_DIMM0 (same location at both sockets)
      - The second chip at Channel2_DIMM0 (same location at both sockets)
      - The third chip at Channel4_DIMM0 (Channel6DIMM0 is used in another socket, i.e.,
  - The Xeon Gold 5418Y processor has 8 DRAM channels/sockets, but the reported memory is divisible by 3. Why?
    - Every socket has three chips of 32GB RAM (96GB per socket). It is the default configuration provided by the vendor. In your opinion, what could be the optimal number of chips of 32GB that have to be installed?
- DRAM TDP per Socket: 33W (combined TDP of DIMM chips installed on each socket)
- Details on how power capping was controlled
  - Power capping is applied by using RAPL_POWER_LIMIT_MSR (0x610)
- HyperThreading is turned on/off: Enable, but all threads are pinned to physical cores
- Turbo Frequency: Enable

Further questions have been answered in the below reply as the text limit has been reached.

sunilk · ‎07-23-2024

Software
- Benchmark code
  - any changes from the baseline stream.c revision 5.10
    - I ran all the experiments on stream.c revision
  - any changes from the default size of 80,000,000 'double' elements per array
    - I changed the size from default to 100M and Iterations from 10 to 100.
- Compiler name/version and compiler options used
  - GCC 10.5.0 and compile with -O3 -fopenmp
Execution env
- OS type & version
  - Ubuntu 20.04 LTS
- OS configuration with respect to HugePages, Automatic NUMA Page Migration
- Hugepagesize: 2048 kB
- Automatic Numa Balancing: Enabled, but we are using numactl -i all
- Number of OpenMP threads requested
  - 72 threads in Xeon Gold 5318H system (4-socket machine)
  - 48 threads in Xeon Gold 5418Y system (2-socket machine)
- Execution environment with respect to binding of memory and binding of process threads (sometimes controlled by environment variables, sometimes by 'launcher' programs)
  - I have launched each experiment using all physical cores available in the machine. So, I have used GOMP_CPU_AFFINITY to bind all threads to physical cores and numactl -i all to enable interleaved page allocation.

Further questions have been answered in the below reply as the text limit has been reached.

sunilk · ‎07-23-2024

Results

Problem size and number of iterations output from the code
- /*********Output from code********/
- Array size = 100000000 (elements), Offset = 0 (elements)
- Memory per array = 762.9 MiB (= 0.7 GiB).
- Total memory required = 2288.8 MiB (= 2.2 GiB).
- Each kernel will be executed 100 times.
- /************************************/
Performance output of the code (Copy, Scale, Add, Triad output lines)
- Output on Xeon Gold 5318H system
  -------------------------------------------------------------
  Function Best Rate MB/s Avg time Min time Max time
  Copy: 115099.7 0.014286 0.013901 0.016175
  Scale: 92615.0 0.017340 0.017276 0.017871
  Add: 101803.5 0.023628 0.023575 0.024091
  Triad: 102617.1 0.023490 0.023388 0.024139
  -------------------------------------------------------------
- Output on Xeon Gold 5418Y system
  -------------------------------------------------------------
  Function Best Rate MB/s Avg time Min time Max time
  Copy: 34175.6 0.047251 0.046817 0.047635
  Scale: 29490.9 0.054459 0.054254 0.054700
  Add: 34862.4 0.068916 0.068842 0.069170
  Triad: 34842.7 0.068970 0.068881 0.069075
  -------------------------------------------------------------
- Report on the number of executions of each configuration and (if run more than once) on the variability in the performance results
  - I ran 10 invocations of each configurations on both systems, but there is negligible deviation in results.
- For cases with non-standard configurations (such as power-capping), externally measured values for average frequency sustained during code execution and actual power consumption during code execution are essential for understanding what is happening.
  - I am attaching three timeline graphs (core frequency, uncore frequency, and power consumption) for both systems. Legends in graphs represent different powercap settings (all values are w.r.t system TDP).
  - Avg Core Frequency during stream execution on 5318H system
  - Avg Unore Frequency during stream execution on 5318H system
  - Avg power consumption during stream execution on 5318H system
  - Avg Core Frequency during stream execution on 5418Y system
  - Avg Uncore Frequency during stream execution on 5418Y system
  - Avg power consumption during stream execution on 5418Y system
I have tried to add all the details, whichever you asked. Please let me know If you need more information.

Regards,

sunilk · ‎07-24-2024

.....

sunilk · ‎07-31-2024

Hi @McCalpinJohn,

Additionally, I would like to add the following points to my previous question:

The systems mentioned above do not have Intel@Optane Memory; however, they do support Optane memory. Is it necessary to populate the Intel@Optane chip also?
In the 5318H system, RAM is populated only in two DRAM channels out of six per socket, and in the 5418Y system, RAM is populated only in three DRAM channels out of eight per socket. As far as I understand, this indicates limited memory bandwidth utilization in both systems. Due to this limited bandwidth, I do not observe any change in performance for memory-bound applications when altering the power cap of the systems. Could you please advise on the number of RAM sticks needed to optimize these systems so that they work for memory-bound applications as well?
Could you also explain the difference between the two measurement units of DRAM speed: megahertz (MHz) and mega transfers per second (MT/s)?
What are the benefits of using dual-rank memory over single-rank memory (1R vs. 2R)?
What are the criteria for choosing a suitable data width of memory for our system (2Rx4 vs. 2Rx8 or 1Rx4 vs 1Rx8)?

Additionally, please let me know if points 4 and 5 depend on the processor model. If so, what would be suitable for the systems mentioned above?

Thanks!

McCalpinJohn · ‎08-06-2024

The primary performance limiter here is the UPI interconnect between the chips. STREAM is normally run with each thread accessing local memory, but your run configuration forces each thread's memory to be distributed across all of the sockets.

Analysis: Xeon Gold 5318H

The 4-socket Xeon Gold 5318H processor supports up to 6 UPI links, which allows two links to each of the other three sockets in a four-socket configuration. If your system is actually wired to use all of the links in this way, then half of the traffic has to cross the 8 links in any bisection of the interconnect.

Using the gcc compiler, the STREAM Copy kernel will be replaced with a call to an optimized memcpy() routine using streaming stores, while the other three kernels will be compiled using ordinary allocating stores. With allocating stores the actual data traffic is higher than the amount that STREAM assumes when computing "bandwidth" values.

For the Copy and Scale kernels the write-allocate increases the memory traffic from 2 words per iteration (1 Read + 1 Write) to 3 words per iteration (1 Read + 1 Write Allocate + 1 Writeback).

For the Add and Triad kernels, using allocating writes increases the memory traffic from 3 words per iteration (2 Read + 1 Write) to 4 words per iteration (2 Reads + 1 Write Allocate + 1 Writeback).

I assume that the UPI links are running at 10.4 GHz, but determining the maximum sustainable transfer speed on these links is complicated (mostly due to minimal documentation on how the UPI protocol packs commands and data elements in to "flits" to be transmitted over the 20 data wires of each interface).

For this use case, each direction of each UPI link is carrying both command/response and data packets, so we can't assume that all of the bits are available for data.

As a ballpark values, I typically assume each interface can average about 80% data utilization when there is no interference between commands and data (e.g., for a core in socket 0 reading from memory in socket 1, the read request commands are all going outbound from s0 to s1, while all the data being returned is going in the opposite direction from s1 to s0.)

With interference between commands and data competing for the same link direction(s) I expect lower average data rates -- something near 2/3 efficiency, or about 17.3 GB/s per direction per UPI link (34.7 GB/s per direction for the pair of links between any two sockets).

For the STREAM Scale, Add, and Triad kernels gcc will not generate streaming stores, so I assume that extra write allocate traffic will be required -- one extra Read For Ownership in each kernel. I can then count how much traffic needs to go over each link for this interleaved memory case and compare that to the available link BW. In every case the aggregate bandwidth for a four-socket run is limited to four times the bandwidth of each socket-to-socket connection (2 UPI links). Assuming 2/3 data payload efficiency, this projects to 4*2*17.333 GB/s = 138.666 GB/s for the *actual* traffic. Since the STREAM benchmark kernels do not count write allocate traffic, the expected reported values are

kernelBW countedExp 4s BWRep 4s BWshortfall

Copy_str 2/2 138.667 115.0997 -17.0%

Scale 2/3 92.444 92.615 +0.2%

Add 3/4 104.000 101.8035 -2.1%

Triad 3/4 104.000 102.6171 -1.3%

As you can see, for the Copy_str entry (assuming streaming stores), there are no allocates, so all of the bandwidth is counted. The reported performance falls well short of the predicted value, while predictions are nearly perfect for the Scale, Add, and Triad kernels.

Why is the Copy kernels with streaming stores so much slower? A significant difference in this case comes from the details of the coherence protocol. For the case of streaming stores, the memory directory in the processor cannot be consulted, so all streaming stores have to generate cross-socket snoops. The cross-socket snoop commands and their responses compete with the data transfers for bandwidth on the UPI interface, with a net loss of about 17% in this case. This seems reasonable for a system with 64-byte cache lines -- a 64-bit address embedded in a snoop command takes 1/8 the space of a data cache line, so an overhead of 12% or more seems easy to generate.

For the Xeon Gold 5418Y system all the details are different, but the non-local memory accesses forced by using "numactl -i all" are likely to make the UPI links the primary limiter in that case as well.

sunilk · ‎08-07-2024

Thanks @McCalpinJohn for your detailed answers,

I was also assuming the same reason you specified behind this behaviour. However, I have two other configurations to run memory-intensive applications (including STREAM), and I have observed the same performance issue with different powercap limits. The following are the two other configurations I have tried:

I ran an application on one socket with all physical cores of the same socket and bound memory allocation to local DRAM using numactl -m $socket_id.
I ran an application with all cores available on all four sockets without numactl (default policy used by Linux).

I noticed that the performance of the application remains constant when we change the powercap from the minimum to the maximum range.

Regards,

McCalpinJohn · ‎08-09-2024

For microbenchmark codes designed to have their performance limited by a single component of the hardware, why would you expect performance to vary with power consumption. At very very low power (CPU frequency) levels there may not be enough processor concurrency to saturate the UPI or DRAM interfaces, but once the performance reaches the limit, adding more power (frequency) does not change the bottleneck -- it just causes the cores to spend more cycles stalled while waiting for the UPI or DRAM transfers.

sunilk · ‎08-15-2024

Hi @McCalpinJohn,

Can performance increase by adding more DRAM chips to populate all empty memory channels?