Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
6434 Discussions

Profiling MKL on Intel CPU shows sudden boost in performance

Richard_S_7
Beginner
635 Views

I am benchmarking the Intel MKL sgemm routine for input size M=10, N=500, K=64 on a dual-socket system with 2 Intel Xeon Gold 6140 CPUs and have noticed that sometimes the runtimes of a few consecutive sgemm calls is significantly lower than the calls before. I made sure that no other programs are executed while benchmarking. I have plotted the runtimes of 200 evaluations and you can see the drop around the 90th evaluation and again around the 180th:

line_mkl_200.png

Can you help me identify, what causes the sudden drop in runtime in the plot? This behavior makes it difficult for me to compare MKL's sgemm with other implementations, because I can never be sure if such a drop has happened or not. 

Many thanks in advance!

Labels (1)
0 Kudos
9 Replies
GouthamK_Intel
Moderator
613 Views

Hi,

Could you please provide the source code which you are using for benchmarking? Are you using FORTRAN / C?

Also, could you please provide the following details which will help us to investigate better:

System configuration(OS Version),

Memory configuration(RAM, Cache),

BLAS Version,

Compiler Version.


Thanks are Regards

Goutham


Richard_S_7
Beginner
604 Views

Hi Goutham,

many thanks for your reply. I am using the following C++ code for benchmarking:

#include <mkl.h>
#include <iostream>
#include <algorithm>
#include <vector>
#include <chrono>

int main(int argc, const char **argv) {
    // input size
    const int M = 10;
    const int N = 500;
    const int K = 64;

    // initialize data
    auto a = (float*) mkl_malloc(M * K * sizeof(float), 64); for (int i = 0; i < M * K; ++i) a[i] = (i + 1) % 10;
    auto b = (float*) mkl_malloc(K * N * sizeof(float), 64); for (int i = 0; i < K * N; ++i) b[i] = (i + 1) % 10;
    auto c = (float*) mkl_malloc(M * N * sizeof(float), 64); for (int i = 0; i < M * N; ++i) c[i] = 0;

    // warm ups
    for (int i = 0; i < 10; ++i) {
        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                    M, N, K,
                    1.0f,
                    a, K,
                    b, N,
                    0.0f,
                    c, N);
    }

    // evaluations
    std::vector<long long> evaluations;
    for (int i = 0; i < 2000; ++i) {
        auto start = std::chrono::high_resolution_clock::now();
        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                    M, N, K,
                    1.0f,
                    a, K,
                    b, N,
                    0.0f,
                    c, N);
        auto end = std::chrono::high_resolution_clock::now();
        evaluations.push_back(std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count());
    }
    std::cout << "min runtime: " << *std::min_element(evaluations.begin(), evaluations.end()) << "ns" << std::endl;

    mkl_free(a);
    mkl_free(b);
    mkl_free(c);
}

 

This is my system:

  • OS: CentOS Linux release 7.8.2003
  • RAM: 192GB
  • Compiler: icpc (ICC) 19.0.1.144 20181018
  • BLAS: MKL 2020 Initial Release

Best,
Richard

GouthamK_Intel
Moderator
557 Views

Hi Richard,

Thanks for providing the source code and system environment details.

We tried to reproduce the same on our end with slightly different hardware on Ubuntu 18.04 OS but with same compiler version and MKL version. We didn't see any such sudden drops and spikes on our end.

However, we are escalating this thread to Subject Matter Experts who will guide you further.

Have a Good day!


Thanks & Regards

Goutham


Gennady_F_Intel
Moderator
540 Views

if you will run this code once again, will you see the same picture?


Richard_S_7
Beginner
486 Views

Hi Gennady,

sorry for the delay. Yes, the result is still the same.

Best,
Richard

Gennady_F_Intel
Moderator
368 Views

Could you try to set the KMP Affinity masks as follows:

export KMP_AFFINITY=compact,1,0,granularity=fine


usually using this affinity help to get the best performance on systems with multi-core processors by requiring that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads.




Richard_S_7
Beginner
355 Views

Many thanks for still trying to solve this problem!

I tried setting the affinity mask and it does indeed seem to improve the measurement:

runtimes_10x500x64_kmp_affinity.png

It is pretty stable now for the first 200 evaluations as measured above, but I noticed it still drops off a bit at the end. Do you maybe have any insights what could be the cause of this?

Gennady_F_Intel
Moderator
344 Views

I think it's OS's fluctuations.


Gennady_F_Intel
Moderator
330 Views

The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


Reply