Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Intel® Memory Latency Checker v3.5 test

Jonas_T_1
Beginner
2,399 Views

Hi ~ Sir :

We have a project of the Purley platform that run the MLC test tool  . 
Our customer's test criteria : [ Run MLC Benchmark to measure Memory B/W - Total B/W should be at least 90% of theoretical max. B/W ]
And the test result is in below . 
We would like to know if the memory bandwidth result really too low ?
What is the reasonable value for the test ?
Thanks .
--------------------SYS config------------------------------------------------
CPU : IC CPU Skylak 26C 2.6G 205W QN5E LGA x 2 pcs .
Memory : Samsung   M393A8K40B22-CWD    Speed: 2666 MHz    Size: 64 GB  x 24 pcs .  
PCH :    C627       
---------------------MLC Test result -----------------------------------------                
 [root@localhost ~]# ./mlc --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --bandwidth_matrix 

Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
        Numa node
Numa node         0         1    
       0    103029.6    34396.5    
       1    34412.8        102846.6    

-----------------Test result -----------------------------------------------------
(2666 * 6 * 8)/1024 = 124.97 GB/s
103.029/124.97 =  82.44 % < 90 %

0 Kudos
19 Replies
McCalpinJohn
Honored Contributor III
2,399 Views

Several DRAM stall conditions are associated with the "shared bus" architecture used by DRAMs. 

It looks like your system has two DIMMs per channel.  This configuration incurs a stall when switching between DIMMs (to allow the bus to settle and to ensure clean separation in time between signals driven from different physical locations on the bus).  

The magnitude of this effect is hard to estimate because Intel provides very little information about how much the memory controller is able to reorder the memory accesses to minimize stalls, and Intel no longer publishes information about the memory controller timing parameters in the processor datasheets.

It is usually possible to reach 90% DRAM bus utilization with an Intel processor, a read-only test pattern, and one DIMM per channel.  Adding the second DIMM to each channel reduces the efficiency because of the "DIMM to DIMM turnaround stalls".  This effect tends to be strongest in the read-only case (since many of the other stall conditions don't occur).   

You may be able to get a small improvement by using MLC options to switch to the AVX-512 instruction set and to switch to using no more than 1 thread per core, but it probably won't increase the value from 82.4% to 90%.

For a mainstream Intel processor, you should be able to get >=80% DRAM bus utilization for most memory access patterns on most memory configurations (excluding the case with one single-rank DIMM per channel), but you may need to fiddle with the instruction set and number of cores used, but I would not be surprised to see some cases delivering slightly below 80%. 

On the Purley platform with one dual-rank DIMM per channel, I usually see the best bandwidth using about 16 cores (one thread per core) and with the AVX-512 instruction set, but this is something you should probably test in your environment.

 

 

 

 

0 Kudos
Jonas_T_1
Beginner
2,398 Views

Hi ~ Sir :

In our customer’s test plan , we also need to run the MLC test tool for stress test .

Test criteria : Run Memory stress tests and look at /var/log/messages, mcelogs for any MCE errors. MLC total Memory B/W should be 85% or above theoretical B/W.

Test config :

CPU : IC CPU Skylak 26C 2.6G 205W QN5E LGA x 2 pcs .

Memory : Samsung   M393A8K40B22-CWD    Speed: 2666 MHz    Size: 64 GB  x 24 pcs . 

PCH :    C627 

In the test result only reach 74% .

Is the test result normal ?

Thanks .

 

MLC Test result :

Intel(R) Memory Latency Checker - v3.5

Command line parameters: --bandwidth_matrix -t1000

 

Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes

Measuring Memory Bandwidths between nodes within system

Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)

Using all the threads from each core if Hyper-threading is enabled

Using Read-only traffic type

                Numa node

Numa node          0         1   

       0        94760.9   34385.6  

       1        34405.2   94896.0  

0 Kudos
McCalpinJohn
Honored Contributor III
2,398 Views

The 94.8 GB/s values are lower than I would expect, but I don't have any SKX nodes with 2 DIMMs per channel for comparison.

It is certainly not implausible that this is the "correct" level of performance for 2 DIMMs per channel -- it all depends on how effectively the memory controller is able to avoid the DIMM-to-DIMM stalls.  As far as I can tell, the specific duration of a DIMM-to-DIMM stall is not published (it is probably tuned by the BIOS during the boot process), but historically these values have been in the range of 4-5 ns.  For DDR4/2666, the major clock is 1.333 GHz, giving a 0.75 ns cycle time.  A cache line transfer takes 4 major cycles, or 3.00 ns.  So the stall time for switching ranks is greater than the cache line transfer time, meaning that switching DIMMs on every read would give less than 50% performance (e.g., for a 4 ns DIMM-to-DIMM stall with stalls before every read, the sustained bandwidth would be 3/(3+4)=42.8% of peak).    If the DIMM select was completely random, one would expect a DIMM switch approximately half of the time, so the average overhead would be about half, giving 3/(3+4/2)=60% of peak.   To get to 74% of peak, the memory controller has to rearrange accesses to eliminate 1/2 of the DIMM-to-DIMM stalls that would occur with random DIMM selection.   Is it reasonable to expect more than this?   I suspect that this is not a trivial question to answer even for members of the memory controller design team....

Have you run this test on more than one node?  (If they all get the same result, it is probably the right answer....)

On a 24-core Xeon Platinum 8160 with 1 dual-rank DDR4/2666 DIMM per channel, I get the following results with various combinations of the "-X" (use only one thread per core), "-Y" (use AVX-256 instructions), and "-Z" (use AVX-512 instructions -- requires the "mlc_avx512" binary):

log.8160.R.bandwidth_matrix:Numa node         0         1    
log.8160.R.bandwidth_matrix-       0    111596.0    34335.7    
log.8160.R.bandwidth_matrix-       1    34361.5    111694.6    
--
log.8160.R.bandwidth_matrix.X:        Numa node
log.8160.R.bandwidth_matrix.X:Numa node         0         1    
log.8160.R.bandwidth_matrix.X-       0    114084.7    34384.3    
log.8160.R.bandwidth_matrix.X-       1    34393.6    113458.3    
--
log.8160.R.bandwidth_matrix.XY:        Numa node
log.8160.R.bandwidth_matrix.XY:Numa node         0         1    
log.8160.R.bandwidth_matrix.XY-       0    112350.5    34404.1    
log.8160.R.bandwidth_matrix.XY-       1    34429.2    112294.0    
--
log.8160.R.bandwidth_matrix.XZ:        Numa node
log.8160.R.bandwidth_matrix.XZ:Numa node         0         1    
log.8160.R.bandwidth_matrix.XZ-       0    111992.2    34423.9    
log.8160.R.bandwidth_matrix.XZ-       1    34439.2    111760.0    
--
log.8160.R.bandwidth_matrix.Y:        Numa node
log.8160.R.bandwidth_matrix.Y:Numa node         0         1    
log.8160.R.bandwidth_matrix.Y-       0    111669.5    34370.5    
log.8160.R.bandwidth_matrix.Y-       1    34396.8    110972.0    
--
log.8160.R.bandwidth_matrix.Z:        Numa node
log.8160.R.bandwidth_matrix.Z:Numa node         0         1    
log.8160.R.bandwidth_matrix.Z-       0    110938.5    34372.2    
log.8160.R.bandwidth_matrix.Z-       1    34394.0    110518.8    

In this (Read-only traffic type) case there is little difference in performance across this set of XYZ options -- the lowest value (110.5 GB/s) is 86.3% of peak and the highest value (114.1 GB/s) is 89.1% of peak.

The differences are slightly larger with mixes of reads and writes, and much larger (~10% range) with non-temporal stores (traffic types W7, W8, W10).

0 Kudos
Jonas_T_1
Beginner
2,398 Views

Hi ~ Sir :

We run the test on other nodes , and the test result are the same .

So , it is caused by the CPU_Memory controller's efficiency ,right ?

Thanks .  

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,398 Views
I just noticed that your second set of results was much slower than the first set, but the only thing that appears different is the longer run time ("-t1000" in the second case. It is possible that this test was getting throttled by power or thermal constraints (which is, of course, one of the reasons to run stress tests....). Two things to consider: (1) I would double-check the actual DRAM bus frequency before making any final conclusions. The system should be able to run two DIMMs per channel at full speed (DDR4/2666), but it is not impossible that the frequency could be reduced. Recent Linux versions should support the uncore DRAM cycle counter. I used the following command: perf stat -a -A -e uncore_imc_0/clockticks/ ./mlc --bandwidth_matrix The results included CPU0 102,693,084,167 uncore_imc_0/clockticks/ CPU24 102,659,835,631 uncore_imc_0/clockticks/ 77.901356954 seconds time elapsed Dividing the counts by the elapsed time gives 1,318,245,126.7, which is about 1% below the expected value of 1,333,333,333, so I conclude that my DRAMs are actually running at full speed. (This value can't change on a "live" system, so it only needs to be checked once....) (2) The "perf stat" command can also report energy usage for the package and for the DRAMs, but I don't think that it has an interface to read the amount of time that the processor frequency was throttled due to package or DRAM power limitations. This information can be read from the RAPL MSRs using the /dev/cpu/*/msr devices. Information is in Section 14.9.5 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-068, November 2018), and a very useful example code is available at http://web.eece.maine.edu/~vweaver/projects/rapl/
0 Kudos
Jonas_T_1
Beginner
2,398 Views

Hi ~ Sir :

Now , we have another question . 

We run the MLC test with 128GB DDR4 memory .

The test result is lower than 64GB DDR4 memory .

Why the test result could not be good enough as  64GB DDR4 memory ?

Thanks .

Note :

  1. We run the MLC test in 5 pcs MB , and the test result are the same .
  2. The MB & BIOS firmware & CPU are the same .
  3. The only difference is DDR4 Memory  .

 

----------- SYSTEM CONFIG ----------------------------------

CPU : Intel 8171M ( 2 pcs in the two socket MB )

DDR4 : Samsung M393A2K40BB2-CTD ( 24 pcs )

-------MLC Test result ------------------------------------------------------

Intel(R) Memory Latency Checker - v3.6

Command line parameters: --bandwidth_matrix 

Using buffer size of 112.680MiB/thread for reads and an additional 112.680MiB/thread for writes

Measuring Memory Bandwidths between nodes within system 

Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)

Using all the threads from each core if Hyper-threading is enabled

Using Read-only traffic type        

                     Numa node

Numa node         0                 1           

0                 88863.4          34387.7          

1                34405.3          89002.8

0 Kudos
McCalpinJohn
Honored Contributor III
2,398 Views

I apologize for not looking more closely at the model numbers of the DIMMs you installed in your system.

The first note in this sequence gives a system configuration of

--------------------SYS config------------------------------------------------
CPU : IC CPU Skylak 26C 2.6G 205W QN5E LGA x 2 pcs .
Memory : Samsung   M393A8K40B22-CWD    Speed: 2666 MHz    Size: 64 GB  x 24 pcs .  

The M393A8K40B22-CWD is a quad-rank DIMM using 3D stacking (3DS).  This is not a "load-reduced DIMM" (LRDIMM), so I don't think that the processor can maintain full DDR4 bus speed with two of these installed in each channel.  It is difficult to find published numbers for what the frequency should be -- probably because this is a function of both the processor and the motherboard, so it is up to the system manufacturer to validate how fast their DDR4 interfaces can run for each DIMM configuration.   Your Read-only bandwidth results of up to 103 GB/s correspond to ~89% of peak at DDR4/2400, which is a plausible frequency for that configuration.  Again, the easiest way to check the frequency is using the "perf stat" approach I showed in my note of 2018-12-03.

The second configuration mentioned is:

----------- SYSTEM CONFIG ----------------------------------
CPU : Intel 8171M ( 2 pcs in the two socket MB )
DDR4 : Samsung M393A2K40BB2-CTD ( 24 pcs )

The Samsung M393A2K40BB2-CTD is a *single-rank* 16 GiB DIMM.   With two of these DIMMs installed in each channel, the processor should be able to maintain full frequency -- but it should still be checked with the "perf stat" command.   There are two potential performance issues here: (1) this configuration has the same number of ranks as my system, but the ranks are in separate DIMM slots, so there will be rank-to-rank stalls when the DIMM driving the bus switches.  When I tested a similar configuration (two single-rank DIMMs per channel) on a Xeon E5-2690 v3 (Haswell) system, I found negligible performance degradation for the "Read-only" traffic type compare to having one single-rank DIMM per channel, but the memory controller in SKX may not handle this case as well.  (2) Each "rank" of DDR4 memory has 16 "banks" divided into 4 "bank groups".  For contiguous memory accesses, performance is maximized if the number of memory access streams does not exceed the total number of "banks" available.  For the first system configuration (two quad-rank DIMM), there are 8 ranks providing 128 banks.  For the second system, there are two ranks providing 32 banks.  The Intel Memory Latency Checker uses all logical processors by default, so 52 threads per socket for your 26-core processor.  Each thread is generating at least one memory access stream, and 52 memory access streams can't fit into 32 banks without conflicts.  Once you start getting conflicts, the memory controller must open and close each bank multiple times to get data from the two conflicting address streams, and this introduces several different stall conditions.   This becomes much worse if there are any writes in the mix, but it reduces performance for reads as well. 

If your system is booted with HyperThreading enabled, the Intel Memory Latency Checker will use all logical processors by default.  You can override this with the "-X" command-line option.   In addition, you can change the instruction set using the "-Y" and "-Z" options.   I typically run all six combinations available with these options.  You may also be able to run the Intel Memory Latency Checker using fewer cores, but I don't know if that is supported in the "--bandwidth_matrix" test.  For my Xeon Platinum 8160 processors with one dual-rank DIMM per channel, I typically get the best memory read performance (by a few percent) using 14-18 of the 24 cores on each chip.

After all of that, I have to say that the performance numbers for your last test are much lower than I would have expected.  I would definitely check the DRAM frequency using "perf stat", then I would try  and would then add the memory controller read and write performance counter events to make sure that the accesses are properly interleaved across the channels.

0 Kudos
Jonas_T_1
Beginner
2,398 Views

Hi ~ Sir :

I am sorry that I write wrong yesterday : 

The DDR4 memory is not Samsung M393A2K40BB2-CTD .

It is Samsung M393AAK40B42-CWD and Micron 144ASQ16G72PSZ-2S6E1 .

The test results are in below . 

Both are not LRDIMM , is the test result value too lower or reasonable ?

Thanks .

 

------System config 1  -----------------

CPU : Intel 8171M ( 2 pcs in the two socket MB )
DDR4 : Samsung M393AAK40B42-CWD ( 24 pcs )

-----MLC test result --------------------

Intel(R) Memory Latency Checker - v3.6
Command line parameters: --bandwidth_matrix 

Using buffer size of 112.680MiB/thread for reads and an additional 112.680MiB/thread for writes
Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
        Numa node
Numa node         0         1    
       0    88939.0    34396.8    
       1    34408.5    89054.9    

 

------System config 2  -----------------

CPU : Intel 8171M ( 2 pcs in the two socket MB )
DDR4 : Micron 144ASQ16G72PSZ-2S6E1 ( 24 pcs )

-----MLC test result --------------------

Intel(R) Memory Latency Checker - v3.6
Command line parameters: --bandwidth_matrix 

Using buffer size of 112.680MiB/thread for reads and an additional 112.680MiB/thread for writes
Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
        Numa node
Numa node         0         1    
       0    88835.7    34396.0    
       1    34409.8    88963.6    

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,398 Views

Both of those DIMMs are 8-rank parts, with two physical ranks, each using 4-high stacked die (3DS) to provide 4 "logical ranks" per physical rank.

I have never worked with these devices before, so I don't know what to expect.   They are reported to place only a single load on the bus for each rank, so having two per channel should not be a serious frequency problem, but DRAM frequency should always be measured and not guessed.
As a check on my Xeon Platinum 8160 system, I just ran a simple test

$ export OMP_NUM_THREADS=1
$ export KMP_AFFINITY=compact
$ perf stat -a -A -e uncore_imc_0/clockticks/ ./stream.avx2.nta.80M.50x
-------------------------------------------------------------
STREAM version $Revision: 5.11 $
-------------------------------------------------------------
[...]

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:       10399.6581       0.1239       0.1231       0.1252
Scale:      10625.8468       0.1222       0.1205       0.1232
Add:        13191.8823       0.1463       0.1455       0.1474
Triad:      13300.8240       0.1448       0.1444       0.1461
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------


 Performance counter stats for 'system wide':

CPU0        37,129,923,471      uncore_imc_0/clockticks/                                    
CPU24       37,129,984,239      uncore_imc_0/clockticks/                                    

      27.914096333 seconds time elapsed

$ echo 37129923471/27.91406933/1000/1000/1000  | bc -l
1.33015086521603906899

The count of ~37B DRAM clockticks in 27.9 seconds corresponds to 1.330 GHz -- extremely close to the expected value of 1.333 GHz for DDR4/2666 DRAM.

Running the same revision of the Intel Memory Latency Checker on my Xeon Platinum 8160 processor with one 16GiB dual-rank DDR4/2666 DIMM per channel, I get almost identical "remote" bandwidth values, but my local bandwidth values are much higher 111.5 GB/s to 112.6 GB/s.   Your performance of ~88.9 GB/s is almost exactly 20% lower -- a tiny bit below what I would expect for DDR4/2133.   There are, of course, many possible reasons for performance loss, but if you don't check the actual frequency on your system there is very little that I can do to help understand what is happening....  

The command "perf stat -a -A -e uncore_imc_0/clockticks/ ./mlc --bandwidth_matrix" will put all the information in one place.

0 Kudos
Jonas_T_1
Beginner
2,398 Views

Hi ~ Sir :

I run the command : perf stat -a -A -e uncore_imc_0/clockticks/ ./mlc --bandwidth_matrix .

And the test results are in below . 

The test result are both = 1.330GHZ .

We need your support to know what is happening in this MLC test .

Thanks . 

 

------System config 1  -----------------

CPU : Intel 8171M ( 2 pcs in the two socket MB )

DDR4 : Samsung M393AAK40B42-CWD ( 24 pcs )

-----MLC test result --------------------

Intel(R) Memory Latency Checker - v3.6

Command line parameters: --bandwidth_matrix

Using buffer size of 112.680MiB/thread for reads and an additional 112.680MiB/thread for writes

Measuring Memory Bandwidths between nodes within system

Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)

Using all the threads from each core if Hyper-threading is enabled

Using Read-only traffic type

                Numa node

Numa node          0         1   

       0        88884.8   34395.7  

       1        34407.8   90026.6  

 

 Performance counter stats for 'system wide':

CPU0        79,780,054,317      uncore_imc_0/clockticks/                                   

CPU26       79,780,052,726      uncore_imc_0/clockticks/                                   

      59.976072517 seconds time elapsed

 

------System config 2  -----------------

CPU : Intel 8171M ( 2 pcs in the two socket MB )

DDR4 : Micron 144ASQ16G72PSZ-2S6E1 ( 24 pcs )

-----MLC test result --------------------

Intel(R) Memory Latency Checker - v3.6

Command line parameters: --bandwidth_matrix

Using buffer size of 112.680MiB/thread for reads and an additional 112.680MiB/thread for writes

Measuring Memory Bandwidths between nodes within system

Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)

Using all the threads from each core if Hyper-threading is enabled

Using Read-only traffic type

                Numa node

Numa node          0         1   

       0        88991.9   34393.8  

       1        34406.8   89053.4  

 

 Performance counter stats for 'system wide':

CPU0        81,331,792,208      uncore_imc_0/clockticks/                                   

CPU26       81,331,792,805      uncore_imc_0/clockticks/                                   

      61.142598532 seconds time elapsed

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,398 Views

So you have confirmed that you have full frequency on the interfaces, what other possibilities are there?

I can think of a few, unfortunately....

  1. These are really big DIMMs and might require more power than your motherboard can deliver.
  2. These DIMMs have a lot of ranks, and might require operation in closed page mode.
  3. These 3DS DIMMs are a relatively new technology, and your BIOS might not be configuring the interleave optimally.

None of my systems have DRAM power throttling configured, so I have never seen it happen before.  In my CentOS 7.6 systems, the /sys/class/powercap/intel-rapl/ interfaces can be used to read the DRAM energy usage, but the RAPL interface that records accumulated throttle time does not appear to have been implemented.   You can read the DRAM energy use using "perf stat -a -A -e power/energy-ram/" and then divide by time to get the average DRAM power.   You can compare this to the maximum DRAM power supported by the processor, which should be in /sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/constraint_0_max_power_uw.   On my systems that is 102.75W, which agrees with my own code that interprets the RAPL registers.   The "enabled" file in the same directory will tell you if DRAM power-capping is enabled on your system -- on mine this returns 0.

If the system is operating in closed page mode, it may be more sensitive to memory controller scheduling issues.  Unfortunately this requires access to configuration information that Intel typically hides, or performance counters that don't appear to be supported by "perf stat".

0 Kudos
Jonas_T_1
Beginner
2,398 Views

Hi ~ Sir :

I run the command : perf stat -a -A -e power/energy-ram/ ./mlc --bandwidth_matrix

And the test result is in the below .

The average DRAM power = 54.978 W .

Then check the file in the /sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/constraint_0_max_power_uw

And The "enabled" file in the same directory .

It looks like no memory throttling .

Is anything else that I need to check ?

Thanks .

[root@localhost intel-rapl:0:0]# cat /sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/constraint_0_max_power_uw

205500000

[root@localhost intel-rapl:0:0]# cat enabled

0

-----------MLC Test result -----------------

Intel(R) Memory Latency Checker - v3.6

Command line parameters: --bandwidth_matrix

 

Using buffer size of 112.680MiB/thread for reads and an additional 112.680MiB/thread for writes

Measuring Memory Bandwidths between nodes within system

Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)

Using all the threads from each core if Hyper-threading is enabled

Using Read-only traffic type

                Numa node

Numa node          0         1   

       0        76588.8   34371.7  

       1        34380.8   76553.1  

 Performance counter stats for 'system wide':

CPU0              3,360.49 Joules power/energy-ram/                                          

CPU26             3,360.43 Joules power/energy-ram/                                          

      61.114398546 seconds time elapsed

0 Kudos
Jonas_T_1
Beginner
2,398 Views

Hi ~ Sir : 

About your opinion 2 , 3 , we will check the BIOS setup menu about the memory setting item  .

And then try it .

Thanks . 

0 Kudos
McCalpinJohn
Honored Contributor III
2,398 Views

It is probably a good idea to check to see if the RAPL powercap interfaces are numbered the same on your system....

cd /sys/class/powercap
for DEV in intel-rapl:*; do echo -n "$DEV "; cat $DEV/name; done
for DEV in intel-rapl:*; do echo -n "$DEV "; cat $DEV/constraint_0_max_power_uw; done

I get these outputs:

intel-rapl:0 package-1
intel-rapl:0:0 dram
intel-rapl:1 package-0
intel-rapl:1:0 dram

intel-rapl:0 150000000
intel-rapl:0:0 102750000
intel-rapl:1 150000000
intel-rapl:1:0 102750000

Your value of 205.5W is 2x my DRAM value, so it is probably correct, but it is good to be sure.

There are lots of performance counters in the memory controller that can be used to investigate why the performance is so much lower than expected, but they are not easy to use and there is little in the way of documentation or validation (or understanding how the "logical ranks" in a 3DS DIMM map to "chip select" ranks tracked by the IMC performance counters).

Although it is certainly overkill for this purpose, I would probably start with my https://github.com/jdmccalpin/periodic-performance-counters and set up a series of tests to run a sweep over a large number of IMC performance counter events.   The events are described in Section 2.3 of the "Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring Guide" (document 336274-001, July 2017).   In the VTune 2018 installation directory there are "experimental" configuration files that can be used with the "emon" utility (also part of VTune) to look at many of these events.  On my system the default configuration file supports only 16 IMC events, but the "experimental" version supports 393 events.

$ wc -l /opt/intel/vtune_amplifier_*/config/sampling/skylake_server_imc*
  396 /opt/intel/vtune_amplifier_2018.4.0.574913/config/sampling/skylake_server_imc_db_experimental.txt
   20 /opt/intel/vtune_amplifier_2018.4.0.574913/config/sampling/skylake_server_imc_db.txt

# Copying the "experimental" file to the default file name allows emon to access the "experimental" events
$ cd /opt/intel/vtune_amplifier_2018.4.0.574913/config/sampling/
$ mv skylake_server_imc_db.txt skylake_server_imc_db_original.txt
$ cp -a skylake_server_imc_db_experimental.txt skylake_server_imc_db.txt

I would focus on the following areas.  The number in parentheses is the number of events in each group (for each memory controller channel).   Generally I prefer to run a workload like "STREAM", for which I can estimate the read and write counts in advance -- I am not sure if the Intel Memory Latency Checker does a fixed amount of work, but the UNC_CAS_COUNT.RD event should be stable from run to run if the code is doing a fixed amount of work (rather than running for a fixed amount of time).  The power and throttling events should be divided by UNC_M_CLOCKTICKS_F, and should be very close to zero.

  • Reference data for normalization of counts
    • UNC_M_CLOCKTICKS_F (fixed function counter -- can be used with any set of 4 programmable counter events)
    • UNC_M_CAS_COUNT.RD (1)
    • UNC_M_CAS_COUNT.WR (1)
  • Power and throttling events
    • UNC_M_POWER_CRITICAL_THROTTLE_CYCLES (1)
    • UNC_M_POWER_THROTTLE_CYCLES.RANK* (8)
    • UNC_M_POWER_PCU_THROTTLING (1)
  • Distribution of accesses across ranks and banks
    • UNC_M_RD_CAS_RANK*.ALLBANKS (8)
    • UNC_M_RD_CAS_RANK*.BANKG* (32)
    • UNC_M_RD_CAS_RANK*.BANK* (128)
  • Page open/close/conflict activity
    • UNC_M_ACT_COUNT.* (3)
    • UNC_M_PRE_COUNT.* (3)
  • Miscellaneous
    • UNC_M_ECC_CORRECTABLE_ERRORS (1)
    • UNC_M_DRAM_REFRESH.* (2)
0 Kudos
Jonas_T_1
Beginner
2,399 Views

Hi ~ Sir :

Update the test status :

1. 

[root@localhost powercap]# for DEV in intel-rapl:*; do echo -n "$DEV "; cat $DEV/name; done

intel-rapl:0 package-1

intel-rapl:0:0 dram

intel-rapl:1 package-0

intel-rapl:1:0 dram

[root@localhost powercap]# for DEV in intel-rapl:*; do echo -n "$DEV "; cat $DEV/constraint_0_max_power_uw; done

intel-rapl:0 205000000

intel-rapl:0:0 205500000

intel-rapl:1 205000000

intel-rapl:1:0 205500000

 

2.  We set the Page policy : [Closed]  & Rank interleaving : [Auto]  in the BIOS setup menu ( the attached file )

-------The MLC Test result-------------------------

Intel(R) Memory Latency Checker - v3.6
Command line parameters: --bandwidth_matrix 

Using buffer size of 112.680MiB/thread for reads and an additional 112.680MiB/thread for writes
Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
        Numa node
Numa node         0         1    
       0    81703.8    34320.0    
       1    34335.8    81777.9    

 

 

0 Kudos
Jonas_T_1
Beginner
2,399 Views

Hi ~ Sir :

Update again the test status :

Today I try different setting in the BIOS setup menu .

Up to now , the best test result is in below , with BIOS setup menu : Rank Interleaving [ 2-way interleaving ]

Thanks .

------------MLC Test result ------------------------------------

Intel(R) Memory Latency Checker - v3.6

Command line parameters: --bandwidth_matrix

Using buffer size of 112.680MiB/thread for reads and an additional 112.680MiB/thread for writes

Measuring Memory Bandwidths between nodes within system

Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)

Using all the threads from each core if Hyper-threading is enabled

Using Read-only traffic type

        Numa node

Numa node         0         1   

       0    95104.6    34335.2   

       1    34353.5    95289.2

0 Kudos
McCalpinJohn
Honored Contributor III
2,399 Views

I am glad that overriding the interleaving provided a good boost in performance.  Operating in closed page mode is probably required for this configuration, but closed page mode is not efficient for contiguous accesses.    It is hard to tell what other factors might be at play -- you would probably need one of the people from the Intel performance team to find out how well these 3DS DIMMs are supposed to work on this platform.

0 Kudos
Jonas_T_1
Beginner
2,399 Views

Hi ~ Sir :

Update today's test result . 

We set the BIOS setup menu : Rank Interleaving [ 1-way interleaving ]

And the test result is much better ( I forget to try it yesterday )

We will do more test and reply to our customer .

I think the test result could PASS our customer's test criteria .

If we have any other question , we will contact you .

Thanks a lot for your great support !! 

------------MLC Test result ------------------------------------

Intel(R) Memory Latency Checker - v3.6

Command line parameters: --bandwidth_matrix

Using buffer size of 112.680MiB/thread for reads and an additional 112.680MiB/thread for writes

Measuring Memory Bandwidths between nodes within system

Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)

Using all the threads from each core if Hyper-threading is enabled

Using Read-only traffic type

        Numa node

Numa node         0         1  

       0    104908.5    34363.1  

       1    34370.6    105075.1

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,399 Views

Good news!  Still some mystery as to why it does not do better, but 82% of peak should not be a problem for most applications -- it is within 10% of the bandwidth obtained with the best configuration.

0 Kudos
Reply