That type of occasional slower timing on a single threaded benchmark often means that the job was moving among caches. In principle, the scheduler should always resume a task on the same core from which it was suspended, whenever possible, but it will move if something has grabbed the preferred core. When you have linked with libiomp5 (openmp or parallel), KMP_AFFINITY works about the same as taskset or numactl. I have observed cases where this avoids those longer timings on short single thread runs.
-fp-model fast enables vectorization of sum reduction, and some other less common reductions, which are disabled by standard compliance options such as -fp-model source. The #pragma simd reduction, introduced in the current icc, optimizes sum and dot product reductions, and sometimes max/min reductions, over-riding the fp-model setting, and even optimizing a few which are missed by -fp-model fast. Unfortunately, #pragma simd doesn't apply to inner_product() or accumulate(), so, if those don't optimize, you may need to drop back to C code when you need the pragma.