- Parallel Computing
The posted papers discussing -qopt-assume-safe-padding are relevant. Compilers use gather and scatter to access partial cache lines only to prevent possible unsafe access outside the array. Unaligned access may be the slowest.
You might get attention from experts if you would ask on the Intel Xeon phi forum.