- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have conducted some experiments with STREAM benchmarks on the following servers and observed unexpected trends:
Characteristics of Machine A:
It is a dual socket machine.
Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz (Codename: Skylake)
Cores per socket = 12; L3 cache = 19.25 MB; TDP=125W; Powercap range: [65W to 125W]
Total RAM at machine A = 192GB (96 GB per socket)
Characteristics of Machine B:
It is a quad socket machine.
Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz (Codename: Cooperlake)
Cores per socket = 18; L3 cache =24.75 MB; TDP=150W; Powercap range: [82W to 150W]
Total RAM at machine B = 256GB (64 GB per socket).
Characteristics of Machine C:
It is a dual socket machine.
Intel(R) Xeon(R) Gold 5418Y CPU @ 2.0GHz (Codename: Sapphire)
Cores per socket = 24; L3 cache = 45 MB; TDP=185W; Powercap range: [115W to 185W]
Total RAM at machine C = 192GB (96 GB per socket)
I ran the STREAM benchmark, where the total memory usage is 2.2 GB. I observed that the execution time on machine A decreases when we increase the power cap. However, the execution time remains constant on machines B and C when we increase the power cap from MIN to MAX.
I ran another version of STREAM, where the total memory usage is quite low (22 MB). The execution time decreases on machines A, B, and C when we increase the power cap. This suggests that we can observe certain trends when DRAM access is negligible or when pages fit into the cache.
I am unable to understand why machines B and C produce the lowest performance for the STREAM benchmark, even though B and C are of a newer generation compared to machine A.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It would be great if anyone could explain this behaviour reported in the above post. 
@connorimes, @McCalpinJohn 
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Performance analysis requires detailed specific information about the configurations of the hardware, software, and execution environment of the measurements.... For these systems, required information includes:
- Hardware configuration -- the information above plus- DIMM configuration: DIMM size, number of ranks per DIMM, DRAM max frequency, and location of the installed DIMMs in the various channels- The Xeon Gold 5318H system has 6 DRAM channels/socket, but the reported memory is not divisible by 3. Why?
- The Xeon Gold 5418Y processor has 8 DRAM channels/socket, but the reported memory is divisible by 3. Why?
 
- Details on how power capping was controlled
- HyperThreading enabled/disabled?
 
- DIMM configuration: DIMM size, number of ranks per DIMM, DRAM max frequency, and location of the installed DIMMs in the various channels
- Software- Benchmark code- any changes from the baseline stream.c revision 5.10
- any changes from the default size of 80,000,000 "double" elements per array
 
- Compiler name/version and compiler options used
 
- Benchmark code
- Execution environment- OS type & version
- OS configuration with respect to HugePages, Automatic NUMA Page Migration
- Number of OpenMP threads requested
- Execution environment with respect to binding of memory and binding of process threads (sometimes controlled by environment variables, sometimes by "launcher" programs)
 
- Results- Problem size and number of iterations output from the code
- Performance output of the code (Copy, Scale, Add, Triad output lines)
- Report on the number of executions of each configuration and (if run more than once) on the variability in the performance results
- For cases with non-standard configurations (such as power-capping), externally measured values for average frequency sustained during code execution and actual power consumption during code execution are essential for understanding what is happening.
 
In server-class multicore processors with 4 or more DRAM channels, multiple cores are always required to reach maximum memory bandwidth, but the number of cores needed depends on the processor generation, the total DRAM bandwidth available, and whether or not streaming store instructions are used. (Also note that in some processor generations, streaming stores are silently converted to ordinary stores in four-socket or larger systems.)
STREAM performance depends on number of cores used and the average frequency of those cores in non-trivial ways. Depending on the processor generation, I have seen significant (>20%) reductions in power consumption by running STREAM with fewer cores running at frequencies below the default (nominal) value.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, John, for the detailed answer. I will surely provide all the details you asked. Give me some time; I will reply soon with all the necessary details.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @McCalpinJohn,
I have included all the details you need.
- Additional Hardware configuration of Xeon Gold 5318H- DIMM configuration: (Part Number: MTA18ASF4G72PDZ-3G2F1)- DIMM size: 32GB DDR4 SDRAM
- Number of ranks per DIMM: 2 (2Rx8)
- DRAM max frequency: 3200 MHz
- Location of all DIMMs- Every socket has six channels (Channel code: 0 to 5) (each channel has two DIMM slots (DIMM code: 0 to 1))
- Location of two installed DIMM chips at each socket (same location on all four sockets)- The first chip at Channel0_DIMM0
- The second chip at Channel3_DIMM0
 
 
- The Xeon Gold 5318H system has 6 DRAM channels/sockets, but the reported memory is not divisible by 3. Why?- Every socket has two chips of 32GB RAM (64GB per socket). It is the default configuration provided by the vendor. In your opinion, what could be the optimal number of chips of 32GB that have to be installed?
 
 
- DRAM TDP per Socket: 9W (combined TDP of DIMM chips installed on each socket)
- Details on how power capping was controlled- Power capping is applied by using RAPL_POWER_LIMIT_MSR (0x610)
 
- HyperThreading is turned on/off: Enabled, but all threads are pinned to physical cores.
- Turbo Frequency: Enable
 
- DIMM configuration: (Part Number: MTA18ASF4G72PDZ-3G2F1)
- Additional Hardware configuration of Xeon Gold 5418Y- DIMM configuration: (Part Number: M321R4GA0BB0-CQKET)- DIMM size: 32GB DDR5 SDRAM
- Number of ranks per DIMM: 1 (1Rx4)
- DRAM max frequency: 4800 MHz
- Location of all DIMMs- Every socket has eight channels (Channel code: 0 to 7) (each channel has one DIMM slot (DIMM code: 0))
- Location of three installed DIMM chips at each socket- The first chip at Channel0_DIMM0 (same location at both sockets)
- The second chip at Channel2_DIMM0 (same location at both sockets)
- The third chip at Channel4_DIMM0 (Channel6DIMM0 is used in another socket, i.e.,
 
 
- The Xeon Gold 5418Y processor has 8 DRAM channels/sockets, but the reported memory is divisible by 3. Why?- Every socket has three chips of 32GB RAM (96GB per socket). It is the default configuration provided by the vendor. In your opinion, what could be the optimal number of chips of 32GB that have to be installed?
 
 
- DRAM TDP per Socket: 33W (combined TDP of DIMM chips installed on each socket)
- Details on how power capping was controlled- Power capping is applied by using RAPL_POWER_LIMIT_MSR (0x610)
 
- HyperThreading is turned on/off: Enable, but all threads are pinned to physical cores
- Turbo Frequency: Enable
 
- DIMM configuration: (Part Number: M321R4GA0BB0-CQKET)
Further questions have been answered in the below reply as the text limit has been reached.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Software- Benchmark code- any changes from the baseline stream.c revision 5.10- I ran all the experiments on stream.c revision
 
- any changes from the default size of 80,000,000 'double' elements per array- I changed the size from default to 100M and Iterations from 10 to 100.
 
 
- any changes from the baseline stream.c revision 5.10
- Compiler name/version and compiler options used- GCC 10.5.0 and compile with -O3 -fopenmp
 
 
- Benchmark code
- Execution env- OS type & version- Ubuntu 20.04 LTS
 
- OS configuration with respect to HugePages, Automatic NUMA Page Migration
- Hugepagesize: 2048 kB
- Automatic Numa Balancing: Enabled, but we are using numactl -i all
- Number of OpenMP threads requested- 72 threads in Xeon Gold 5318H system (4-socket machine)
- 48 threads in Xeon Gold 5418Y system (2-socket machine)
 
- Execution environment with respect to binding of memory and binding of process threads (sometimes controlled by environment variables, sometimes by 'launcher' programs)- I have launched each experiment using all physical cores available in the machine. So, I have used GOMP_CPU_AFFINITY to bind all threads to physical cores and numactl -i all to enable interleaved page allocation.
 
 
- OS type & version
Further questions have been answered in the below reply as the text limit has been reached.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Results
- Problem size and number of iterations output from the code- /*********Output from code********/
- Array size = 100000000 (elements), Offset = 0 (elements)
- Memory per array = 762.9 MiB (= 0.7 GiB).
- Total memory required = 2288.8 MiB (= 2.2 GiB).
- Each kernel will be executed 100 times.
- /************************************/
 
- Performance output of the code (Copy, Scale, Add, Triad output lines)- Output on Xeon Gold 5318H system------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 115099.7 0.014286 0.013901 0.016175 Scale: 92615.0 0.017340 0.017276 0.017871 Add: 101803.5 0.023628 0.023575 0.024091 Triad: 102617.1 0.023490 0.023388 0.024139 ------------------------------------------------------------- 
- Output on Xeon Gold 5418Y system - ------------------------------------------------------------- - Function Best Rate MB/s Avg time Min time Max time - Copy: 34175.6 0.047251 0.046817 0.047635 - Scale: 29490.9 0.054459 0.054254 0.054700 - Add: 34862.4 0.068916 0.068842 0.069170 - Triad: 34842.7 0.068970 0.068881 0.069075 - ------------------------------------------------------------- 
- Report on the number of executions of each configuration and (if run more than once) on the variability in the performance results- I ran 10 invocations of each configurations on both systems, but there is negligible deviation in results.
 
- For cases with non-standard configurations (such as power-capping), externally measured values for average frequency sustained during code execution and actual power consumption during code execution are essential for understanding what is happening.- I am attaching three timeline graphs (core frequency, uncore frequency, and power consumption) for both systems. Legends in graphs represent different powercap settings (all values are w.r.t system TDP).
- Avg Core Frequency during stream execution on 5318H system
- Avg Unore Frequency during stream execution on 5318H system
- Avg power consumption during stream execution on 5318H system
- Avg Core Frequency during stream execution on 5418Y system
- Avg Uncore Frequency during stream execution on 5418Y system
- Avg power consumption during stream execution on 5418Y system
 
 
- Output on Xeon Gold 5318H system
- I have tried to add all the details, whichever you asked. Please let me know If you need more information.
Regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
.....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @McCalpinJohn,
Additionally, I would like to add the following points to my previous question:
- The systems mentioned above do not have Intel@Optane Memory; however, they do support Optane memory. Is it necessary to populate the Intel@Optane chip also? 
- In the 5318H system, RAM is populated only in two DRAM channels out of six per socket, and in the 5418Y system, RAM is populated only in three DRAM channels out of eight per socket. As far as I understand, this indicates limited memory bandwidth utilization in both systems. Due to this limited bandwidth, I do not observe any change in performance for memory-bound applications when altering the power cap of the systems. Could you please advise on the number of RAM sticks needed to optimize these systems so that they work for memory-bound applications as well? 
- Could you also explain the difference between the two measurement units of DRAM speed: megahertz (MHz) and mega transfers per second (MT/s)? 
- What are the benefits of using dual-rank memory over single-rank memory (1R vs. 2R)? 
- What are the criteria for choosing a suitable data width of memory for our system (2Rx4 vs. 2Rx8 or 1Rx4 vs 1Rx8)? 
Additionally, please let me know if points 4 and 5 depend on the processor model. If so, what would be suitable for the systems mentioned above?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @McCalpinJohn for your detailed answers,
I was also assuming the same reason you specified behind this behaviour. However, I have two other configurations to run memory-intensive applications (including STREAM), and I have observed the same performance issue with different powercap limits. The following are the two other configurations I have tried:
- I ran an application on one socket with all physical cores of the same socket and bound memory allocation to local DRAM using numactl -m $socket_id. 
- I ran an application with all cores available on all four sockets without numactl (default policy used by Linux). 
I noticed that the performance of the application remains constant when we change the powercap from the minimum to the maximum range.
Regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For microbenchmark codes designed to have their performance limited by a single component of the hardware, why would you expect performance to vary with power consumption. At very very low power (CPU frequency) levels there may not be enough processor concurrency to saturate the UPI or DRAM interfaces, but once the performance reaches the limit, adding more power (frequency) does not change the bottleneck -- it just causes the cores to spend more cycles stalled while waiting for the UPI or DRAM transfers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @McCalpinJohn,
Can performance increase by adding more DRAM chips to populate all empty memory channels?
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page