I've noticed that when profiling with VTune for an OpenMP loop with DYNAMIC scheduling, that I'm getting a large variance ( (up to 3x longer than shortest) in time reported. Code is being run with OMP_NUM_THREADS=4 in batch on a node comprising 2x 6-core Westmere L5640 chips.
I'll do some work to explore causes but was wondering if others had any insight? Previous experience of running OpenMP codes on such nodes has indicated a 10% or so variance but this is significantly higher (and hides some optimisations when the reported time is in the higher region of the variance).
Large variations in performance are expected with dynamic scheduling, since affinity placement isn't effective. The largest variations may be associated with lack of CPU last level cache locality.
Within a single CPU, typical variations are larger on 6-core WSM than on other Xeon CPUs due to the sharing of 2 paths to last level cache among 4 of the 6 cores. As you are running only 4 threads across a total of 12 cores, it may run fastest when the threads aren't sharing data paths and encounter a favorable placement across CPUs.
In applications such as MKL sparse matrix multiplication, dynamic scheduling is avoided by first scanning the data set and dividing the work as evenly as possible among threads with static scheduling to improve cache locality. On Westmere, the full performance of equivalent 4-core CPUs may be achieved by distributing work to cores which don't share data paths.