- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm measuring three approaches to matrix multiplication performance: a naive blocked OpenMP implementation, Eigen, and SGEMM from MKL 2021.4.0. For simplicity all matrices are square, type `float`, size `n x n`, aligned at 64-bytes. The compiler is `GCC 8.3.1` with compilation flags `-msse4.2 -O3 -fopenmp`. OS is `CentOS 7`
I don't understand why MKL SGEMM is the slowest. Why is a naive OpenMP implementation faster than a fancy-optimized library?
**Blocked OpenMP (`BS = n / 64`):**
**Eigen**
**MKL SGEMM**
The google benchmark results for Intel Xeon Silver 4114, 2 sockets, 2 NUMA nodes:
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------
MatMul/OmpBlk/4096/64/real_time 1132 ms 1038 ms 1
MatMul/OmpBlk/16384/64/real_time 83668 ms 80612 ms 1
MatMul/OmpBlk/32768/64/real_time 1562980 ms 1492184 ms 1
MatMul/Eigen/4096/real_time 878 ms 867 ms 1
MatMul/Eigen/16384/real_time 36140 ms 31629 ms 1
MatMul/Eigen/32768/real_time 259762 ms 246788 ms 1
MatMul/Blas/4096/real_time 4091 ms 3719 ms 1
MatMul/Blas/16384/real_time 219940 ms 219581 ms 1
MatMul/Blas/32768/real_time 1773874 ms 1750015 ms 1
**ldd snippet:**
libmkl_intel_ilp64.so.1 => /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_ilp64.so.1
libmkl_core.so.1 => /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_core.so.1
libmkl_intel_thread.so.1 => /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.so.1
libiomp5.so => /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64/libiomp5.so
(code formatting apparently doesn't work)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anton,
Thanks for reaching out to us.
>>SGEMM from MKL 2021.4.0......I don't understand why MKL SGEMM is the slowest
Could you please try the latest MKL version which is 2022.1.0 and see if there is any improvement?
It would be a great help if you provide us with the complete sample reproducer code along with steps to reproduce the issue and the output of the MKL_VERBOSE variable (usage: export MKL_VERBOSE=1 before running the executable) so that we can check this issue from our end as well.
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi VidyalathaB,
Thank you for your kind answer. Unfortunally, I cannot change the MKL version. The code snippets contain all the information and logic, you may be interested in. If it's non-obvious how to figure out calls signature, I'm always glad to help. Here they are:
MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 137.42ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 100.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 99.37ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 93.22ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 4.34s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 4.19s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 3.85s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 3.73s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anton,
Thanks for providing the verbose output.
Could you please attach the complete code here in the forum so that it would help us to do a quick check from our end?
If you do not want to post it here please let us know so that we can contact you privately.
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>
#include <omp.h>
#include <mkl.h>
#include <immintrin.h>
#include <Eigen/Dense>
#include <benchmark/benchmark.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start=0) {
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
int perfcheck();
void init_data(int n, float* A_mat, float* B_mat, float seed) {
float* A = A_mat;
float* B = B_mat;
for ( int i = 0 ; i < n ; i++) {
for ( int j = 0 ; j < n ; j++) {
A[i*n+j]=(float)i/(float)n;
B[i*n+j]=(float)j/(float)n;
}
}
}
void verify_res(int n, const float* C1, const float* C2, int ncase)
{
float norm = 0.0;
const float* C = C2;
float rtol=1e-04, atol=1e-05;
for (int i = 0 ; i < n ; i++)
for (int j = 0 ; j < n ; j++)
{
norm += (C[i*n+j]-(float)(i*j)/(float)n)*(C[i*n+j]-(float)(i*j)/(float)n);
if (abs(C1[i*n+j]-C2[i*n+j]) > atol + rtol * C1[i*n+j]) {
printf("Error in (%d, %d)\n", i, j);
printf("%d - C1: %f C2: %f\n", ncase, C1[i*n+j], C2[i*n+j]);
throw 1;
}
}
if (norm > 1e-8)
{
printf("Error: %f\n", norm);
throw 1;
}
}
void matmat_mul(int n, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = &A_mat[0];
const float* B = &B_mat[0];
float* C = &C_out[0];
for ( int i = 0 ; i < n ; i++) {
for ( int j = 0 ; j < n ; j++) {
C[i*n+j] *= beta;
for (int k = 0 ; k < n ; k++) {
C[i*n+j] += alpha*A[i*n+k]*B[k*n+j];
}
}
}
}
void matmat_mul_simd(int n, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);
__m128 alpha4 = _mm_set1_ps(alpha);
__m128 beta4 = _mm_set1_ps(beta);
for(int i=0; i<n; i++) {
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&C[i*n+j]);
c4 = _mm_mul_ps(beta4,c4);
_mm_store_ps(&C[i*n+j], c4);
}
}
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
__m128 a4 = _mm_set1_ps(A[i*n+k]);
a4 = _mm_mul_ps(alpha4,a4);
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&C[i*n+j]);
__m128 b4 = _mm_load_ps(&B[k*n+j]);
c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);
_mm_store_ps(&C[i*n+j], c4);
}
}
}
}
void matmat_mul_omp(int n, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);
#pragma omp parallel for schedule(dynamic)
#pragma vector aligned
for (int i = 0 ; i < n ; i++) {
for (int j = 0 ; j < n ; j++) {
float tmpSum = 0.0;
#pragma omp reduction (+: tmpSum)
#pragma GCC unroll 8
for (int k = 0 ; k < n ; k++) {
tmpSum += A[i*n+k]*B[k*n+j];
}
C[i*n+j] = beta * C[i*n+j] + alpha * tmpSum;
}
}
}
void matmat_mul_simd_omp(int n, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);
__m128 alpha4 = _mm_set1_ps(alpha);
__m128 beta4 = _mm_set1_ps(beta);
#pragma omp parallel for collapse(2)
for(int i=0; i<n; i++) {
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&C[i*n+j]);
c4 = _mm_mul_ps(beta4,c4);
_mm_store_ps(&C[i*n+j], c4);
}
}
#pragma omp parallel for schedule(dynamic)
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
__m128 a4 = _mm_set1_ps(A[i*n+k]);
a4 = _mm_mul_ps(alpha4,a4);
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&C[i*n+j]);
__m128 b4 = _mm_load_ps(&B[k*n+j]);
c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);
_mm_store_ps(&C[i*n+j], c4);
}
}
}
}
void matmat_mul_omp_blk(int n, int BS, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);
#pragma omp parallel for collapse(2)
#pragma vector aligned
for(int i=0; i<n; i++)
for(int j=0; j<n; j++)
C[i*n+j] *= beta;
#pragma omp parallel for schedule(dynamic)
#pragma vector aligned
for (int i = 0; i < n; i+=BS)
for (int k = 0; k < n; k+=BS)
for (int j = 0; j < n; j+=BS)
for (int ii = i; ii < i+BS; ii++)
for (int kk = k; kk < k+BS; kk++)
for (int jj = j; jj < j+BS; jj++)
C[ii*n+jj] += alpha*A[ii*n+kk]*B[kk*n+jj];
}
void matmat_mul_eigen(int n, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);
// "The best code is the code I don't have to write"
Eigen::Map<const Eigen::MatrixXf> AM(A, n, n);
Eigen::Map<const Eigen::MatrixXf> BM(B, n, n);
Eigen::Map<Eigen::MatrixXf> CM(C, n, n);
CM.noalias() = beta*CM + alpha*(BM * AM); // fortran order!
}
void matmat_mul_sgemm(const int n, float* A_mat, float* B_mat, float* C_out) {
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);
// "The best code is the code I don't have to write"
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
n, n, n, alpha, A, n, B, n, beta, C, n);
}
class MatMul : public benchmark::Fixture {
protected:
int i=0;
int n;
float* b;
float* A, * B, * C;
public:
void SetUp(const ::benchmark::State& state) {
n = state.range(0);
A=(float*)_mm_malloc(n*n*sizeof(float), 64);
B=(float*)_mm_malloc(n*n*sizeof(float), 64);
C=(float*)_mm_malloc(n*n*sizeof(float), 64);
init_data(n, A, B, (float)time(NULL));
}
void TearDown(const ::benchmark::State& state) {
_mm_free(A);
_mm_free(B);
_mm_free(C);
}
};
BENCHMARK_DEFINE_F(MatMul, Verify)(benchmark::State& st) {
n = 16;
float* A2=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* B2=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C1=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C2=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C3=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C4=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C5=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C6=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C7=(float*)_mm_malloc(n*n*sizeof(float), 64);
init_data(n, A2, B2, (float)time(NULL));
for (auto _ : st) {
matmat_mul(n, A2, B2, C1);
matmat_mul_omp(n, A2, B2, C2);
matmat_mul_simd(n, A2, B2, C3);
matmat_mul_omp_blk(n, 4, A2, B2, C4);
matmat_mul_simd_omp(n, A2, B2, C5);
matmat_mul_eigen(n, A2, B2, C6);
matmat_mul_sgemm(n, A2, B2, C7);
}
verify_res(n, C1, C2, 1);
verify_res(n, C2, C3, 2);
verify_res(n, C3, C4, 3);
verify_res(n, C4, C5, 4);
verify_res(n, C5, C6, 5);
verify_res(n, C6, C7, 6);
_mm_free(A2);
_mm_free(B2);
_mm_free(C1);
_mm_free(C2);
_mm_free(C3);
_mm_free(C4);
_mm_free(C5);
_mm_free(C6);
_mm_free(C7);
}
BENCHMARK_REGISTER_F(MatMul, Verify)
->Unit(benchmark::kMillisecond)
->Arg(8)
->UseRealTime();
BENCHMARK_DEFINE_F(MatMul, SingleThread)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, Simd)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul_simd(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, Omp)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul_omp(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, SimdOmp)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul_simd_omp(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, OmpBlk)(benchmark::State& st) {
int bs = n / st.range(1);
for (auto _ : st) {
matmat_mul_omp_blk(n, bs, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, Eign)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul_eigen(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, MklBlas)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul_sgemm(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
int perfbench(int n) {
float* A=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* B=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C=(float*)_mm_malloc(n*n*sizeof(float), 64);
unsigned long long dt = 0;
int nrepeats = 3;
std::vector<unsigned long long> times(nrepeats);
int bs = 128;
matmat_mul_omp_blk(n, bs, A, B, C); // warm up
times.clear(); times.resize(nrepeats);
for (int i = 0; i < nrepeats; i++)
{
dt = dtime_usec(0);
matmat_mul_omp_blk(n, bs, A, B, C);
times[i] = dtime_usec(dt);
}
dt = std::accumulate(times.begin(), times.end(), 0.0) / times.size() / 1000;
std::cout << "omp_blk time: " << dt << "ms" << std::endl;
matmat_mul_eigen(n, A, B, C); // warm up
times.clear(); times.resize(nrepeats);
for (int i = 0; i < nrepeats; i++)
{
dt = dtime_usec(0);
matmat_mul_eigen(n, A, B, C);
times[i] = dtime_usec(dt);
}
dt = std::accumulate(times.begin(), times.end(), 0.0) / times.size() / 1000;
std::cout << "Eigen time: " << dt << "ms" << std::endl;
matmat_mul_sgemm(n, A, B, C); // warm up
times.clear(); times.resize(nrepeats);
for (int i = 0; i < nrepeats; i++)
{
dt = dtime_usec(0);
matmat_mul_sgemm(n, A, B, C);
times[i] = dtime_usec(dt);
}
dt = std::accumulate(times.begin(), times.end(), 0.0) / times.size() / 1000;
std::cout << "MKL sgemm time: " << dt << "ms" << std::endl;
_mm_free(A);
_mm_free(B);
_mm_free(C);
return 0;
}
#if 1
int main()
{
const int n = 4*1024;
perfbench(n);
// perfcheck();
}
#else
int from = 512; // 1 MB
// int to = 2048; // 2k * 2k * 8 = 32M
int to = 32*1024;
int mult = 8;
int step = 512;
BENCHMARK_MAIN();
#endif
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anton,
Thanks for sharing the reproducer here.
I tried running the code and here is the output I'm getting which shows MKL is taking less time
omp_blk time: 1179ms
Eigen time: 535ms
MKL sgemm time: 172ms
Command used:
g++ main.cpp -O3 -fopenmp -DMKL_ILP64 -m64 -I"/usr/local/include/eigen3/" -I"${MKLROOT}/include" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -lbenchmark -static-libstdc++ -msse4.2
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm so excited for your success, Vidya! The code works on your machine! At this point, it may be time to research the situation. What if we both start reading this thread from the beginning to find the CPU and OS specifications?
Just in case my сomplation flags
CC
-Wall -Wno-unknown-pragmas -mavx2 -O3 -DNDEBUG -fopenmp -std=gnu++17
LINK
-lgomp -lpthread -Wl,-rpath=.../oneapi/mkl/2021.4.0/lib/intel64 .../oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_ilp64.so .../oneapi/mkl/2021.4.0/lib/intel64/libmkl_core.so .../oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.so .../oneapi/compiler/latest/linux/compiler/lib/intel64/libiomp5.so -lm -ldl -lpthread -pthread -lrt
Flags -DMKL_ILP64 -m64 and -Wl,--no-as-needed didn't change anything.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Any new ideas?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anton,
>>What if we both start reading this thread from the beginning to find the CPU and OS specifications?
I apologize for the delay and I appreciate your patience.
It took me a while in finding the CentOS machine and setting up the environment and installing the dependencies to test the code and here are the results
Output:
omp_blk time: 1025ms
Eigen time: 453ms
MKL sgemm time: 50ms
Even here the MKL is performing better than others
CPU Model:
Intel(R) Xeon(R) Platinum 8260M CPU @ 2.40GHz
CentOS 8 (I could see that you are trying it on CentOS 7 but support for CentOS* 7 is deprecated in this release, Intel oneAPI 2022.1, and will be removed in a future release Refer: https://www.intel.com/content/www/us/en/developer/articles/system-requirements/oneapi-math-kernel-library-system-requirements.html)
Compilation command used:
g++ test.cpp -Wall -Wno-unknown-pragmas -mavx2 -DNDEBUG -std=gnu++17 -O3 -fopenmp -DMKL_ILP64 -m64 -I"/usr/local/include/eigen3/" -I"/home/administrator/vidya/benchmark/include/" -I"${MKLROOT}/include" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -L"./benchmark/build/src" -lbenchmark
g++ --version > (GCC) 8.5.0 20210514
In my previous post, the results are tested on Ubuntu 18.04.6 with CPU model Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz.
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Great. Things started moving. At least I have some hope now. OS is not quite relevant (I appreciate you found CentOS anyway). Your CPU is much better though.
To match things up I rerun the task on Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
I also tuned up the block size parameter, now it's 32.
My output:
omp_blk(bs32) 41 ms
Eigen 15 ms
MKL sgemm 42 ms
The MKL time is quite similar to yours, but my Eigen is still way better and MKL is still the slowest.
I tried to adjust MKL_ENABLE_INSTRUCTIONS variable, it didn't help. I increased n to 2048 to
CPU Model:
Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
AVX2
$ MKL_ENABLE_INSTRUCTIONS=AVX2 ./build/Release/perfdemo
perfbench for n = 2048
OpenMP Tile32 139 ms
Eigen 91 ms
MKL sgemm 456 ms
AVX512
$ MKL_ENABLE_INSTRUCTIONS=AVX512 ./build/Release/perfdemo
perfbench for n = 2048
OpenMP Tile32 137 ms
Eigen 108 ms
MKL sgemm 273 ms
Eigen time seems pretty noisy up to x5 - x10. What if you run the binary a few times? Will the eigen time change?
I'm totally puzzled. What's going on?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anton,
Thanks for getting back to us.
This time I changed the n value of perfbench to 2048 and made the value of bs to 32 (please let me know if there is any mistake here)
>>What if you run the binary a few times? Will the eigen time change?
Sure, this is how I executed
for i in {1..20}; do ./a.out $i; done > out.txt
Please find the attached file out.txt to see the output.
>>I'm totally puzzled. What's going on?
I tried it using MKL 2022.1.0 and I could not see the timings that you are getting (i guess the only difference I could see in both our environments is the MKL version being used as everything else is almost similar). You can give it a try with the latest version which is 2022.2.0 now available for download and let us know if the issue still persists.
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anton,
As we haven't heard back from you, could you please provide us with an update regarding the issue? Please let us know if you still observe the same timings with the latest oneMKL version which is 2022.2.0.
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vidya,
On my workstation I cannot update MKL and installing everything locally will take too much time, which I need elsewhere. You cannot use an older MKL version either. Therefore, I'm postponing the investigation and waiting for the deployment team to update my libs later.
My results are quite consistent though.
Cheers,
Anton
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anton,
>>Therefore, I'm postponing the investigation and waiting for the deployment team to update my libs later.
Could you please let us know if you (or your company or institution) have priority support? If yes, we would recommend you post the issue at https://supporttickets.intel.com/servicecenter?lang=en-US
If not, as per your request we can postpone it and close this thread for now.
Please do let us know.
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK. Let's close it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anton,
>>OK. Let's close it.
Thanks for the confirmation!
We are closing this thread for now. Please post a new question if you need any additional assistance for Intel as this thread will no longer be monitored.
Regards,
Vidya.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page