I need to make some benchmark on a Dual-Xeon and on a KNL. When doing the benchmark, I would like to get a time and an estimation of the error. For that, I run 10 benchmarks of the same program in a row and take the average as the time and compute the standard deviation to get an error. The goal is obviously to reduce the standard deviation.
I do the following, on a Linux CentOS 7.3 box:
- Disable turbo boost on the BIOS
- Make sure that the Dual-Xeon is in NUMA mode in the BIOS (That's what I want)
- Make sure to correctly pin threads to the correct hardware thread with OMP_PLACES and OMP_PROC_BIND. I also tend to use the following tip to do a scaling graph. I plot the number of cores on the x-axis and I plot 2 curves for the Dual-Xeon. One curve for 1 thread per core, and one curve for 2 threads per core. I do the same for KNL with 4 curves.
- I'll also try to boot without the graphical user interface, see if it makes a difference
Do you have any other tip?
I assume that each benchmark has an outer loop that runs the compute section many times. You should discard the 1st iteration, as this may have the OpenMP thread pool creation overhead time, as well as any array "first touch". You may wish to include this time in an actual application. For a benchmark, you likely determine the per-iteration time, taking an average including the first pass is not correct.
doWork(); // first touch pass
t0 = omp_get_wtime();
for(int i=0; i<nReps; ++i)
t1 = omp_get_wtime();
repTime = (t1 - t0) / nReps;
Linux tries to do "the right thing" with NUMA page allocation, but it does not always succeed, and does not have all of the policies that one might want to apply in a multi-socket job.
- As a start, the code should be set up so that the data is first touched using the same threads that will be later used to work with that data. If this is the case, then running "numastat" before and after the dual-socket job will provide information about whether the OS was able to provide the pages on the NUMA node where it wanted to place them.
- If the threads of an application process are limited to a single package, the "numactl" command with the "--membind" option can be used to guarantee that all of the pages are allocated on the desired NUMA node (or else the job will be killed).
For KNL the test methodology will depend on the operating mode of the system.
- In "Flat" mode, you will want to use the "numactl" command with the "--membind" option to place the data in either the MCDRAM or DDR4 memory, depending on which you want to test.
- In "Cache" mode, performance can be quite variable depending on the size of the job's data footprint (relative to the MCDRAM cache size) and the randomness of the page tables. Intel's OS provides a tool to sort the page free lists which can reduce variability and improve performance.
- In the "Sub-NUMA-Cluster" modes, the KNL needs to be treated as a 2-node or 4-node NUMA system, with the added complexity of dealing with "Flat" vs "Cache" use of the MCDRAM.