topic Re:Why BLAS SGEMM is slow? in Intel® oneAPI Math Kernel Library

Why BLAS SGEMM is slow?

AntK — Thu, 15 Sep 2022 13:28:47 GMT

I'm measuring three approaches to matrix multiplication performance: a naive blocked OpenMP implementation, Eigen, and SGEMM from MKL 2021.4.0. For simplicity all matrices are square, type `float`, size `n x n`, aligned at 64-bytes. The compiler is `GCC 8.3.1` with compilation flags `-msse4.2 -O3 -fopenmp`. OS is `CentOS 7`

I don't understand why MKL SGEMM is the slowest. Why is a naive OpenMP implementation faster than a fancy-optimized library?

**Blocked OpenMP (`BS = n / 64`):**

#pragma omp parallel for

#pragma vector aligned

for(int i=0; i<n; i++)

for(int j=0; j<n; j++)

C[i*n+j] *= beta;

#pragma omp parallel for

#pragma vector aligned

for (int i = 0; i < n; i+=BS)

for (int k = 0; k < n; k+=BS)

for (int j = 0; j < n; j+=BS)

for (int ii = i; ii < i+BS; ii++)

for (int kk = k; kk < k+BS; kk++)

for (int jj = j; jj < j+BS; jj++)

C[ii*n+jj] += alpha*A[ii*n+kk]*B[kk*n+jj];

**Eigen**

Eigen::Map<const Eigen::MatrixXf> AM(A, n, n);

Eigen::Map<const Eigen::MatrixXf> BM(B, n, n);

Eigen::Map<Eigen::MatrixXf> CM(C, n, n);

CM.noalias() = beta*CM + alpha*(BM * AM); // fortran order!

**MKL SGEMM**

cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,

n, n, n, alpha, A, n, B, n, beta, C, n);

The google benchmark results for Intel Xeon Silver 4114, 2 sockets, 2 NUMA nodes:

Benchmark Time CPU Iterations
-------------------------------------------------------------------------------
MatMul/OmpBlk/4096/64/real_time 1132 ms 1038 ms 1
MatMul/OmpBlk/16384/64/real_time 83668 ms 80612 ms 1
MatMul/OmpBlk/32768/64/real_time 1562980 ms 1492184 ms 1
MatMul/Eigen/4096/real_time 878 ms 867 ms 1
MatMul/Eigen/16384/real_time 36140 ms 31629 ms 1
MatMul/Eigen/32768/real_time 259762 ms 246788 ms 1
MatMul/Blas/4096/real_time 4091 ms 3719 ms 1
MatMul/Blas/16384/real_time 219940 ms 219581 ms 1
MatMul/Blas/32768/real_time 1773874 ms 1750015 ms 1

**ldd snippet:**

libmkl_intel_ilp64.so.1 => /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_ilp64.so.1
libmkl_core.so.1 => /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_core.so.1
libmkl_intel_thread.so.1 => /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.so.1
libiomp5.so => /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64/libiomp5.so

(code formatting apparently doesn't work)

Re:Why BLAS SGEMM is slow?

VidyalathaB_Intel — Fri, 16 Sep 2022 09:41:35 GMT

Hi Anton,

Thanks for reaching out to us.

>>SGEMM from MKL 2021.4.0......I don't understand why MKL SGEMM is the slowest

Could you please try the latest MKL version which is 2022.1.0 and see if there is any improvement?

It would be a great help if you provide us with the complete sample reproducer code along with steps to reproduce the issue and the output of the MKL_VERBOSE variable (usage: export MKL_VERBOSE=1 before running the executable) so that we can check this issue from our end as well.

Regards,

Vidya.

Re: Why BLAS SGEMM is slow?

AntK — Sun, 18 Sep 2022 08:13:52 GMT

Hi VidyalathaB,

Thank you for your kind answer. Unfortunally, I cannot change the MKL version. The code snippets contain all the information and logic, you may be interested in. If it's non-obvious how to figure out calls signature, I'm always glad to help. Here they are:

void matmat_mul_*(int n, const float* A_mat, const float* B_mat, float* C_out);

This code will help to allocate the chunk of memory for the call:

float *A=(float*)_mm_malloc(n*n*sizeof(float), 64);

And this is to free the chunk:

_mm_free(A);

Please don't forget to restart the computer before running the experiments.

This is the MKL_VERBOSE console output for n = 1024:

MKL_VERBOSE oneMKL 2021.0 Update 4 Product build 20210904 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.20GHz ilp64 intel_thread
MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 137.42ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 100.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 99.37ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 93.22ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20

And this is for n = 4096

MKL_VERBOSE oneMKL 2021.0 Update 4 Product build 20210904 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.20GHz ilp64 intel_thread
MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 4.34s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 4.19s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 3.85s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20
MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 3.73s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20

Re:Why BLAS SGEMM is slow?

VidyalathaB_Intel — Tue, 20 Sep 2022 06:24:33 GMT

Hi Anton,

Thanks for providing the verbose output.

Could you please attach the complete code here in the forum so that it would help us to do a quick check from our end?

If you do not want to post it here please let us know so that we can contact you privately.

Regards,

Vidya.

Re: Why BLAS SGEMM is slow?

AntK — Tue, 20 Sep 2022 16:57:34 GMT

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>

#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>

#include <omp.h>
#include <mkl.h>
#include <immintrin.h>

#include <Eigen/Dense>

#include <benchmark/benchmark.h>

#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0) {

timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

int perfcheck();

void init_data(int n, float* A_mat, float* B_mat, float seed) {
float* A = A_mat;
float* B = B_mat;

for ( int i = 0 ; i < n ; i++) {
for ( int j = 0 ; j < n ; j++) {
A[i*n+j]=(float)i/(float)n;
B[i*n+j]=(float)j/(float)n;
}
}
}

void verify_res(int n, const float* C1, const float* C2, int ncase)
{
float norm = 0.0;
const float* C = C2;
float rtol=1e-04, atol=1e-05;
for (int i = 0 ; i < n ; i++)
for (int j = 0 ; j < n ; j++)
{
norm += (C[i*n+j]-(float)(i*j)/(float)n)*(C[i*n+j]-(float)(i*j)/(float)n);

if (abs(C1[i*n+j]-C2[i*n+j]) > atol + rtol * C1[i*n+j]) {
printf("Error in (%d, %d)\n", i, j);
printf("%d - C1: %f C2: %f\n", ncase, C1[i*n+j], C2[i*n+j]);
throw 1;
}
}

if (norm > 1e-8)
{
printf("Error: %f\n", norm);
throw 1;
}
}

void matmat_mul(int n, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = &A_mat[0];
const float* B = &B_mat[0];
float* C = &C_out[0];

for ( int i = 0 ; i < n ; i++) {
for ( int j = 0 ; j < n ; j++) {
C[i*n+j] *= beta;
for (int k = 0 ; k < n ; k++) {
C[i*n+j] += alpha*A[i*n+k]*B[k*n+j];
}
}
}
}

void matmat_mul_simd(int n, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);

__m128 alpha4 = _mm_set1_ps(alpha);
__m128 beta4 = _mm_set1_ps(beta);

for(int i=0; i<n; i++) {
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&C[i*n+j]);
c4 = _mm_mul_ps(beta4,c4);
_mm_store_ps(&C[i*n+j], c4);
}
}

for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
__m128 a4 = _mm_set1_ps(A[i*n+k]);
a4 = _mm_mul_ps(alpha4,a4);
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&C[i*n+j]);
__m128 b4 = _mm_load_ps(&B[k*n+j]);
c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);
_mm_store_ps(&C[i*n+j], c4);
}
}
}
}

void matmat_mul_omp(int n, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);

#pragma omp parallel for schedule(dynamic)
#pragma vector aligned
for (int i = 0 ; i < n ; i++) {
for (int j = 0 ; j < n ; j++) {
float tmpSum = 0.0;
#pragma omp reduction (+: tmpSum)
#pragma GCC unroll 8
for (int k = 0 ; k < n ; k++) {
tmpSum += A[i*n+k]*B[k*n+j];
}
C[i*n+j] = beta * C[i*n+j] + alpha * tmpSum;
}
}
}

void matmat_mul_simd_omp(int n, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);

__m128 alpha4 = _mm_set1_ps(alpha);
__m128 beta4 = _mm_set1_ps(beta);

#pragma omp parallel for collapse(2)
for(int i=0; i<n; i++) {
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&C[i*n+j]);
c4 = _mm_mul_ps(beta4,c4);
_mm_store_ps(&C[i*n+j], c4);
}
}

#pragma omp parallel for schedule(dynamic)
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
__m128 a4 = _mm_set1_ps(A[i*n+k]);
a4 = _mm_mul_ps(alpha4,a4);
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&C[i*n+j]);
__m128 b4 = _mm_load_ps(&B[k*n+j]);
c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);
_mm_store_ps(&C[i*n+j], c4);
}
}
}
}

void matmat_mul_omp_blk(int n, int BS, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);

#pragma omp parallel for collapse(2)
#pragma vector aligned
for(int i=0; i<n; i++)
for(int j=0; j<n; j++)
C[i*n+j] *= beta;

#pragma omp parallel for schedule(dynamic)
#pragma vector aligned
for (int i = 0; i < n; i+=BS)
for (int k = 0; k < n; k+=BS)
for (int j = 0; j < n; j+=BS)
for (int ii = i; ii < i+BS; ii++)
for (int kk = k; kk < k+BS; kk++)
for (int jj = j; jj < j+BS; jj++)
C[ii*n+jj] += alpha*A[ii*n+kk]*B[kk*n+jj];
}

void matmat_mul_eigen(int n, const float* A_mat, const float* B_mat, float* C_out) {
// C = alpha * A x B + beta * C
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);

// "The best code is the code I don't have to write"
Eigen::Map<const Eigen::MatrixXf> AM(A, n, n);
Eigen::Map<const Eigen::MatrixXf> BM(B, n, n);
Eigen::Map<Eigen::MatrixXf> CM(C, n, n);
CM.noalias() = beta*CM + alpha*(BM * AM); // fortran order!
}

void matmat_mul_sgemm(const int n, float* A_mat, float* B_mat, float* C_out) {
float alpha = 1.0, beta = 0.0;
const float* A = (const float*)__builtin_assume_aligned(&A_mat[0], 64);
const float* B = (const float*)__builtin_assume_aligned(&B_mat[0], 64);
float* C = (float*) __builtin_assume_aligned(&C_out[0], 64);

// "The best code is the code I don't have to write"
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
n, n, n, alpha, A, n, B, n, beta, C, n);
}

class MatMul : public benchmark::Fixture {
protected:
int i=0;
int n;
float* b;
float* A, * B, * C;
public:
void SetUp(const ::benchmark::State& state) {
n = state.range(0);

A=(float*)_mm_malloc(n*n*sizeof(float), 64);
B=(float*)_mm_malloc(n*n*sizeof(float), 64);
C=(float*)_mm_malloc(n*n*sizeof(float), 64);

init_data(n, A, B, (float)time(NULL));
}

void TearDown(const ::benchmark::State& state) {
_mm_free(A);
_mm_free(B);
_mm_free(C);
}
};

BENCHMARK_DEFINE_F(MatMul, Verify)(benchmark::State& st) {
n = 16;
float* A2=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* B2=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C1=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C2=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C3=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C4=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C5=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C6=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C7=(float*)_mm_malloc(n*n*sizeof(float), 64);

init_data(n, A2, B2, (float)time(NULL));

for (auto _ : st) {
matmat_mul(n, A2, B2, C1);
matmat_mul_omp(n, A2, B2, C2);
matmat_mul_simd(n, A2, B2, C3);
matmat_mul_omp_blk(n, 4, A2, B2, C4);
matmat_mul_simd_omp(n, A2, B2, C5);
matmat_mul_eigen(n, A2, B2, C6);
matmat_mul_sgemm(n, A2, B2, C7);
}

verify_res(n, C1, C2, 1);
verify_res(n, C2, C3, 2);
verify_res(n, C3, C4, 3);
verify_res(n, C4, C5, 4);
verify_res(n, C5, C6, 5);
verify_res(n, C6, C7, 6);

_mm_free(A2);
_mm_free(B2);
_mm_free(C1);
_mm_free(C2);
_mm_free(C3);
_mm_free(C4);
_mm_free(C5);
_mm_free(C6);
_mm_free(C7);
}

BENCHMARK_REGISTER_F(MatMul, Verify)
->Unit(benchmark::kMillisecond)
->Arg(8)
->UseRealTime();

BENCHMARK_DEFINE_F(MatMul, SingleThread)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, Simd)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul_simd(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, Omp)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul_omp(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, SimdOmp)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul_simd_omp(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, OmpBlk)(benchmark::State& st) {
int bs = n / st.range(1);
for (auto _ : st) {
matmat_mul_omp_blk(n, bs, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, Eign)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul_eigen(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}
BENCHMARK_DEFINE_F(MatMul, MklBlas)(benchmark::State& st) {
for (auto _ : st) {
matmat_mul_sgemm(n, A, B, C);
benchmark::DoNotOptimize(C);
benchmark::ClobberMemory();
}
}

int perfbench(int n) {
float* A=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* B=(float*)_mm_malloc(n*n*sizeof(float), 64);
float* C=(float*)_mm_malloc(n*n*sizeof(float), 64);

unsigned long long dt = 0;
int nrepeats = 3;
std::vector<unsigned long long> times(nrepeats);

int bs = 128;
matmat_mul_omp_blk(n, bs, A, B, C); // warm up
times.clear(); times.resize(nrepeats);
for (int i = 0; i < nrepeats; i++)
{
dt = dtime_usec(0);
matmat_mul_omp_blk(n, bs, A, B, C);
times[i] = dtime_usec(dt);
}
dt = std::accumulate(times.begin(), times.end(), 0.0) / times.size() / 1000;
std::cout << "omp_blk time: " << dt << "ms" << std::endl;

matmat_mul_eigen(n, A, B, C); // warm up
times.clear(); times.resize(nrepeats);
for (int i = 0; i < nrepeats; i++)
{
dt = dtime_usec(0);
matmat_mul_eigen(n, A, B, C);
times[i] = dtime_usec(dt);
}
dt = std::accumulate(times.begin(), times.end(), 0.0) / times.size() / 1000;
std::cout << "Eigen time: " << dt << "ms" << std::endl;

matmat_mul_sgemm(n, A, B, C); // warm up
times.clear(); times.resize(nrepeats);
for (int i = 0; i < nrepeats; i++)
{
dt = dtime_usec(0);
matmat_mul_sgemm(n, A, B, C);
times[i] = dtime_usec(dt);
}
dt = std::accumulate(times.begin(), times.end(), 0.0) / times.size() / 1000;
std::cout << "MKL sgemm time: " << dt << "ms" << std::endl;

_mm_free(A);
_mm_free(B);
_mm_free(C);

return 0;
}

#if 1

int main()
{
const int n = 4*1024;
perfbench(n);
// perfcheck();
}

#else

int from = 512; // 1 MB
// int to = 2048; // 2k * 2k * 8 = 32M
int to = 32*1024;
int mult = 8;
int step = 512;

BENCHMARK_REGISTER_F(MatMul, SingleThread)

->Unit(benchmark::kMillisecond)

->DenseRange(from, to, step)

->UseRealTime();

BENCHMARK_REGISTER_F(MatMul, Simd)

->Unit(benchmark::kMillisecond)

->RangeMultiplier(mult)->Range(from, to)

->UseRealTime();

BENCHMARK_REGISTER_F(MatMul, Omp)

->Unit(benchmark::kMillisecond)

->RangeMultiplier(mult)->Range(from, to)

->UseRealTime();

BENCHMARK_REGISTER_F(MatMul, SimdOmp)

->Unit(benchmark::kMillisecond)

->RangeMultiplier(mult)->Range(from, to)

->UseRealTime();

BENCHMARK_REGISTER_F(MatMul, OmpBlk)

->Unit(benchmark::kMillisecond)

->ArgsProduct({

benchmark::CreateRange(from, to, mult),

benchmark::CreateRange(64, 256, /*multi=*/4)

})

->UseRealTime();

BENCHMARK_REGISTER_F(MatMul, Eign)

->Unit(benchmark::kMillisecond)

->RangeMultiplier(mult)->Range(from, to)

->UseRealTime();

BENCHMARK_REGISTER_F(MatMul, MklBlas)

->Unit(benchmark::kMillisecond)

->RangeMultiplier(mult)->Range(from, to)

->UseRealTime();

BENCHMARK_MAIN();

#endif

Re:Why BLAS SGEMM is slow?

VidyalathaB_Intel — Wed, 21 Sep 2022 08:28:27 GMT

Hi Anton,

Thanks for sharing the reproducer here.

I tried running the code and here is the output I'm getting which shows MKL is taking less time

omp_blk time: 1179ms

Eigen time: 535ms

MKL sgemm time: 172ms

Command used:

g++ main.cpp -O3 -fopenmp -DMKL_ILP64 -m64 -I"/usr/local/include/eigen3/" -I"${MKLROOT}/include" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -lbenchmark -static-libstdc++ -msse4.2

Regards,

Vidya.

Re: Re:Why BLAS SGEMM is slow?

AntK — Thu, 22 Sep 2022 22:28:22 GMT

I'm so excited for your success, Vidya! The code works on your machine! At this point, it may be time to research the situation. What if we both start reading this thread from the beginning to find the CPU and OS specifications?

Just in case my сomplation flags

CC
-Wall -Wno-unknown-pragmas -mavx2 -O3 -DNDEBUG -fopenmp -std=gnu++17

LINK

-lgomp -lpthread -Wl,-rpath=.../oneapi/mkl/2021.4.0/lib/intel64 .../oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_ilp64.so .../oneapi/mkl/2021.4.0/lib/intel64/libmkl_core.so .../oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.so .../oneapi/compiler/latest/linux/compiler/lib/intel64/libiomp5.so -lm -ldl -lpthread -pthread -lrt

Flags -DMKL_ILP64 -m64 and -Wl,--no-as-needed didn't change anything.

Re: Why BLAS SGEMM is slow?

AntK — Wed, 28 Sep 2022 19:13:35 GMT

Any new ideas?

Re:Why BLAS SGEMM is slow?

VidyalathaB_Intel — Thu, 29 Sep 2022 11:38:18 GMT

Hi Anton,

>>What if we both start reading this thread from the beginning to find the CPU and OS specifications?

I apologize for the delay and I appreciate your patience.

It took me a while in finding the CentOS machine and setting up the environment and installing the dependencies to test the code and here are the results

Output:

omp_blk time: 1025ms

Eigen time: 453ms

MKL sgemm time: 50ms

Even here the MKL is performing better than others

CPU Model:

Intel(R) Xeon(R) Platinum 8260M CPU @ 2.40GHz

CentOS 8 (I could see that you are trying it on CentOS 7 but support for CentOS* 7 is deprecated in this release, Intel oneAPI 2022.1, and will be removed in a future release Refer: https://www.intel.com/content/www/us/en/developer/articles/system-requirements/oneapi-math-kernel-library-system-requirements.html)

Compilation command used:

g++ test.cpp -Wall -Wno-unknown-pragmas -mavx2 -DNDEBUG -std=gnu++17 -O3 -fopenmp -DMKL_ILP64 -m64 -I"/usr/local/include/eigen3/" -I"/home/administrator/vidya/benchmark/include/" -I"${MKLROOT}/include" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -L"./benchmark/build/src" -lbenchmark

g++ --version > (GCC) 8.5.0 20210514

In my previous post, the results are tested on Ubuntu 18.04.6 with CPU model Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz.

Regards,

Vidya.

Re: Why BLAS SGEMM is slow?

AntK — Thu, 29 Sep 2022 22:20:12 GMT

Great. Things started moving. At least I have some hope now. OS is not quite relevant (I appreciate you found CentOS anyway). Your CPU is much better though.

To match things up I rerun the task on Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz

I also tuned up the block size parameter, now it's 32.

My output:

omp_blk(bs32) 41 ms
Eigen 15 ms
MKL sgemm 42 ms

The MKL time is quite similar to yours, but my Eigen is still way better and MKL is still the slowest.

I tried to adjust MKL_ENABLE_INSTRUCTIONS variable, it didn't help. I increased n to 2048 to

CPU Model:

Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz

AVX2

$ MKL_ENABLE_INSTRUCTIONS=AVX2 ./build/Release/perfdemo
perfbench for n = 2048
OpenMP Tile32 139 ms
Eigen 91 ms
MKL sgemm 456 ms

AVX512

$ MKL_ENABLE_INSTRUCTIONS=AVX512 ./build/Release/perfdemo
perfbench for n = 2048
OpenMP Tile32 137 ms
Eigen 108 ms
MKL sgemm 273 ms

Eigen time seems pretty noisy up to x5 - x10. What if you run the binary a few times? Will the eigen time change?

I'm totally puzzled. What's going on?

Re: Why BLAS SGEMM is slow?

VidyalathaB_Intel — Fri, 30 Sep 2022 10:47:49 GMT

Hi Anton,

Thanks for getting back to us.

This time I changed the n value of perfbench to 2048 and made the value of bs to 32 (please let me know if there is any mistake here)

>>What if you run the binary a few times? Will the eigen time change?

Sure, this is how I executed

for i in {1..20}; do ./a.out $i; done > out.txt

Please find the attached file out.txt to see the output.

>>I'm totally puzzled. What's going on?

I tried it using MKL 2022.1.0 and I could not see the timings that you are getting (i guess the only difference I could see in both our environments is the MKL version being used as everything else is almost similar). You can give it a try with the latest version which is 2022.2.0 now available for download and let us know if the issue still persists.

Regards,

Vidya.

Re:Why BLAS SGEMM is slow?

VidyalathaB_Intel — Fri, 07 Oct 2022 05:14:25 GMT

Hi Anton,

As we haven't heard back from you, could you please provide us with an update regarding the issue? Please let us know if you still observe the same timings with the latest oneMKL version which is 2022.2.0.

Regards,

Vidya.

Re: Why BLAS SGEMM is slow?

AntK — Fri, 07 Oct 2022 05:20:16 GMT

Hi Vidya,

On my workstation I cannot update MKL and installing everything locally will take too much time, which I need elsewhere. You cannot use an older MKL version either. Therefore, I'm postponing the investigation and waiting for the deployment team to update my libs later.

My results are quite consistent though.

Cheers,

Anton

Re:Why BLAS SGEMM is slow?

VidyalathaB_Intel — Mon, 10 Oct 2022 08:55:38 GMT

Hi Anton,

>>Therefore, I'm postponing the investigation and waiting for the deployment team to update my libs later.

Could you please let us know if you (or your company or institution) have priority support? If yes, we would recommend you post the issue at https://supporttickets.intel.com/servicecenter?lang=en-US

If not, as per your request we can postpone it and close this thread for now.

Please do let us know.

Regards,

Vidya.

Re: Why BLAS SGEMM is slow?

AntK — Mon, 10 Oct 2022 22:31:25 GMT

OK. Let's close it.

Re:Why BLAS SGEMM is slow?

VidyalathaB_Intel — Tue, 11 Oct 2022 04:29:49 GMT

Hi Anton,

>>OK. Let's close it.

Thanks for the confirmation!

We are closing this thread for now. Please post a new question if you need any additional assistance for Intel as this thread will no longer be monitored.

Regards,

Vidya.