- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This post is regarding benchmarking algorithms on the Intel Xeon processors.
I have been attempting to reproduce the benchmarks as provided in the code from the article above. Specifically mmatest1.c from the zip file attached in the article. One observation I have is that there is a considerable warm-up time which leads to big overhead on the first algorithm being benchmarked. (In this case, the cblas_sgemm function.)
16 loop counts are often not enough to offset the thread 'warm-up' time. I am not sure what the correct terminology for this would be.
- Can anyone confirm this? When benchmarking, is it better to give a 'warm-up' kernel to the threads?
- Where can i read up more on this?
- Can anyone also suggest the best way/algorithm/function to access sub matrices of size (MxM) from a larger matrix?
To review my code, kindly refer to: GitHub - akhauriyash/XNOR-Nets: An OpenMP parallelized implementation of XNOR kernels.
- Tags:
- Parallel Computing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not knowing whether you are looking for something applicable to all Intel CPUs, yes, we can confirm that the Intel Xeon Phi KNC was particularly slow in setting up data structures the first time (easily an extra half second or so the first time). Rather than run a large number of repetitions and average the fast and slow iterations, you may consider the typical tactic of running 1 or 2 iterations for warm-up before running the timed code. Any CPU is likely to incur more last level cache misses the first time a data region is entered, so you must consider whether you want to include these in your benchmark timing (if possible) or exclude them.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not knowing whether you are looking for something applicable to all Intel CPUs, yes, we can confirm that the Intel Xeon Phi KNC was particularly slow in setting up data structures the first time (easily an extra half second or so the first time). Rather than run a large number of repetitions and average the fast and slow iterations, you may consider the typical tactic of running 1 or 2 iterations for warm-up before running the timed code. Any CPU is likely to incur more last level cache misses the first time a data region is entered, so you must consider whether you want to include these in your benchmark timing (if possible) or exclude them.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page