Solved: Benchmarking algorithms on Intel Xeon Gold (DevCloud)

YAkha · ‎04-01-2018

This post is regarding benchmarking algorithms on the Intel Xeon processors.

Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System | Intel® Software

I have been attempting to reproduce the benchmarks as provided in the code from the article above. Specifically mmatest1.c from the zip file attached in the article. One observation I have is that there is a considerable warm-up time which leads to big overhead on the first algorithm being benchmarked. (In this case, the cblas_sgemm function.)

16 loop counts are often not enough to offset the thread 'warm-up' time. I am not sure what the correct terminology for this would be.

Can anyone confirm this? When benchmarking, is it better to give a 'warm-up' kernel to the threads?
Where can i read up more on this?
Can anyone also suggest the best way/algorithm/function to access sub matrices of size (MxM) from a larger matrix?

To review my code, kindly refer to: GitHub - akhauriyash/XNOR-Nets: An OpenMP parallelized implementation of XNOR kernels.

TimP · ‎04-02-2018

Not knowing whether you are looking for something applicable to all Intel CPUs, yes, we can confirm that the Intel Xeon Phi KNC was particularly slow in setting up data structures the first time (easily an extra half second or so the first time). Rather than run a large number of repetitions and average the fast and slow iterations, you may consider the typical tactic of running 1 or 2 iterations for warm-up before running the timed code. Any CPU is likely to incur more last level cache misses the first time a data region is entered, so you must consider whether you want to include these in your benchmark timing (if possible) or exclude them.

View solution in original post

TimP · ‎04-02-2018

Not knowing whether you are looking for something applicable to all Intel CPUs, yes, we can confirm that the Intel Xeon Phi KNC was particularly slow in setting up data structures the first time (easily an extra half second or so the first time). Rather than run a large number of repetitions and average the fast and slow iterations, you may consider the typical tactic of running 1 or 2 iterations for warm-up before running the timed code. Any CPU is likely to incur more last level cache misses the first time a data region is entered, so you must consider whether you want to include these in your benchmark timing (if possible) or exclude them.