exporting MKL_VERBOSE=1 will you see changing the lapack/blas execution time? With the same routines and the same input problem sizes. Are you sure that there is no third party process running at the same time?
I am 100% certain this had nothing to do with other processes (there were none). Very reproducibly, "sync ; echo 3 > /proc/sys/vm/drop_caches" improved the speed by about a factor of 1.5.
N.B., the code already has a number of timers in it.