Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6956 Discussions

Profiling MKL on Intel CPU shows sudden boost in performance

Richard_S_7
Beginner
2,021 Views

I am benchmarking the Intel MKL sgemm routine for input size M=10, N=500, K=64 on a dual-socket system with 2 Intel Xeon Gold 6140 CPUs and have noticed that sometimes the runtimes of a few consecutive sgemm calls is significantly lower than the calls before. I made sure that no other programs are executed while benchmarking. I have plotted the runtimes of 200 evaluations and you can see the drop around the 90th evaluation and again around the 180th:

line_mkl_200.png

Can you help me identify, what causes the sudden drop in runtime in the plot? This behavior makes it difficult for me to compare MKL's sgemm with other implementations, because I can never be sure if such a drop has happened or not. 

Many thanks in advance!

Labels (1)
0 Kudos
9 Replies
GouthamK_Intel
Moderator
1,999 Views

Hi,

Could you please provide the source code which you are using for benchmarking? Are you using FORTRAN / C?

Also, could you please provide the following details which will help us to investigate better:

System configuration(OS Version),

Memory configuration(RAM, Cache),

BLAS Version,

Compiler Version.


Thanks are Regards

Goutham


0 Kudos
Richard_S_7
Beginner
1,990 Views

Hi Goutham,

many thanks for your reply. I am using the following C++ code for benchmarking:

#include <mkl.h>
#include <iostream>
#include <algorithm>
#include <vector>
#include <chrono>

int main(int argc, const char **argv) {
    // input size
    const int M = 10;
    const int N = 500;
    const int K = 64;

    // initialize data
    auto a = (float*) mkl_malloc(M * K * sizeof(float), 64); for (int i = 0; i < M * K; ++i) a[i] = (i + 1) % 10;
    auto b = (float*) mkl_malloc(K * N * sizeof(float), 64); for (int i = 0; i < K * N; ++i) b[i] = (i + 1) % 10;
    auto c = (float*) mkl_malloc(M * N * sizeof(float), 64); for (int i = 0; i < M * N; ++i) c[i] = 0;

    // warm ups
    for (int i = 0; i < 10; ++i) {
        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                    M, N, K,
                    1.0f,
                    a, K,
                    b, N,
                    0.0f,
                    c, N);
    }

    // evaluations
    std::vector<long long> evaluations;
    for (int i = 0; i < 2000; ++i) {
        auto start = std::chrono::high_resolution_clock::now();
        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                    M, N, K,
                    1.0f,
                    a, K,
                    b, N,
                    0.0f,
                    c, N);
        auto end = std::chrono::high_resolution_clock::now();
        evaluations.push_back(std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count());
    }
    std::cout << "min runtime: " << *std::min_element(evaluations.begin(), evaluations.end()) << "ns" << std::endl;

    mkl_free(a);
    mkl_free(b);
    mkl_free(c);
}

 

This is my system:

  • OS: CentOS Linux release 7.8.2003
  • RAM: 192GB
  • Compiler: icpc (ICC) 19.0.1.144 20181018
  • BLAS: MKL 2020 Initial Release

Best,
Richard

0 Kudos
GouthamK_Intel
Moderator
1,943 Views

Hi Richard,

Thanks for providing the source code and system environment details.

We tried to reproduce the same on our end with slightly different hardware on Ubuntu 18.04 OS but with same compiler version and MKL version. We didn't see any such sudden drops and spikes on our end.

However, we are escalating this thread to Subject Matter Experts who will guide you further.

Have a Good day!


Thanks & Regards

Goutham


0 Kudos
Gennady_F_Intel
Moderator
1,926 Views

if you will run this code once again, will you see the same picture?


0 Kudos
Richard_S_7
Beginner
1,872 Views

Hi Gennady,

sorry for the delay. Yes, the result is still the same.

Best,
Richard

0 Kudos
Gennady_F_Intel
Moderator
1,754 Views

Could you try to set the KMP Affinity masks as follows:

export KMP_AFFINITY=compact,1,0,granularity=fine


usually using this affinity help to get the best performance on systems with multi-core processors by requiring that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads.




0 Kudos
Richard_S_7
Beginner
1,741 Views

Many thanks for still trying to solve this problem!

I tried setting the affinity mask and it does indeed seem to improve the measurement:

runtimes_10x500x64_kmp_affinity.png

It is pretty stable now for the first 200 evaluations as measured above, but I noticed it still drops off a bit at the end. Do you maybe have any insights what could be the cause of this?

0 Kudos
Gennady_F_Intel
Moderator
1,730 Views

I think it's OS's fluctuations.


0 Kudos
Gennady_F_Intel
Moderator
1,716 Views

The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


0 Kudos
Reply