I am very curious about the cache performance of KNL with SIMD intrinsic. I have the following observations.
I write a matrix to matrix multiplication program. I have two versions. The first one does gemm in a formal way, without intrinsic. And I wrote another one with intrinsic. Let's say the matrices are small ones, i.e., 16 * 16. I profile the two versions using VTune. I find that the first version really has a very small number of L1 cache misses. However, the second one has much more L1 cache misses than the first version, several times more.
The first version is compiled with -O1, so it is not vectrized. The second version is fully vectorized since I use the AVX512 intrinsic instructions. For the runtime, the first version takes much more time without doubt.
The question is why the cache miss number is so much different? The two versions should have the same memory access pattern. And all data (three 16*16 floats matrices) should be cached in the L1 cache. There should be only compulsory cache misses.