We are currently evaluating the usage of Intel MKL to improve the performance of our application. However we found out that on computers with a Intel i7 12th gen CPU, the performance significantly decreased when using Intel MKL. Profiling the application showed that two MKL BLAS function were taking up most of the CPU time, namely
- [MKL BLAS]@avx2_xdaxpy
- [MKL BLAS]@avx2_dger
We are able to reproduce the issue with the attached modified mkl-sample programm.
With said programm we can see that the mkl function cblas_dger runs considerably slower on i7-12th gen CPU when using the AVX2 instruction-set with a single thread compared to using the AVX instruction-set with a single thread.
Running the same code on a i7 10th gen showed increased performance when using the AVX2 instruction set.
See the attached screenshot for a timing of 1'000 calls to said function on a i7-12700K.
used oneMKL version: oneMKL 2023.0 Product build 20221128
Thanks for posting on Intel Communities.
Thanks for sharing the feedback. We have informed the development team regarding the same. We will get back to you soon with an update.
We would like to inform you that the performance difference comes from the core architecture. The recent desktop uses Cove cores, but it has a larger cache and more memory channels than old AVX2 desktop cores. This resulted in behavior differences and simultaneous access against memory performs better on recent desktop parts. This is measured on ICX. "test" is Fortran code based and behavior is similar to AVX. Please find the performance charts attached.
Multiple memory access will cause performance degradations on AVX2-based Xeon. MKL doesn't have a mechanism to distinguish old and new AVX2-based architectures. So performance improvement could not be made.
We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.