I need to make some benchmark on a Dual-Xeon and on a KNL. When doing the benchmark, I would like to get a time and an estimation of the error. For that, I run 10 benchmarks of the same program in a row and take the average as the time and compute the standard deviation to get an error. The goal is obviously to reduce the standard deviation.
I do the following, on a Linux CentOS 7.3 box:
- Disable turbo boost on the BIOS
- Make sure that the Dual-Xeon is in NUMA mode in the BIOS (That's what I want)
- Make sure to correctly pin threads to the correct hardware thread with OMP_PLACES and OMP_PROC_BIND. I also tend to use the following tip to do a scaling graph. I plot the number of cores on the x-axis and I plot 2 curves for the Dual-Xeon. One curve for 1 thread per core, and one curve for 2 threads per core. I do the same for KNL with 4 curves.
- I'll also try to boot without the graphical user interface, see if it makes a difference
Do you have any other tip?
I assume that each benchmark has an outer loop that runs the compute section many times. You should discard the 1st iteration, as this may have the OpenMP thread pool creation overhead time, as well as any array "first touch". You may wish to include this time in an actual application. For a benchmark, you likely determine the per-iteration time, taking an average including the first pass is not correct.
doWork(); // first touch pass
t0 = omp_get_wtime();
for(int i=0; i<nReps; ++i)
t1 = omp_get_wtime();
repTime = (t1 - t0) / nReps;
Linux tries to do "the right thing" with NUMA page allocation, but it does not always succeed, and does not have all of the policies that one might want to apply in a multi-socket job.
For KNL the test methodology will depend on the operating mode of the system.