Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
7230 토론

Profiling MKL on Intel CPU shows sudden boost in performance

Richard_S_7
초급자
3,864 조회수

I am benchmarking the Intel MKL sgemm routine for input size M=10, N=500, K=64 on a dual-socket system with 2 Intel Xeon Gold 6140 CPUs and have noticed that sometimes the runtimes of a few consecutive sgemm calls is significantly lower than the calls before. I made sure that no other programs are executed while benchmarking. I have plotted the runtimes of 200 evaluations and you can see the drop around the 90th evaluation and again around the 180th:

line_mkl_200.png

Can you help me identify, what causes the sudden drop in runtime in the plot? This behavior makes it difficult for me to compare MKL's sgemm with other implementations, because I can never be sure if such a drop has happened or not. 

Many thanks in advance!

레이블 (1)
0 포인트
9 응답
GouthamK_Intel
중재자
3,842 조회수

Hi,

Could you please provide the source code which you are using for benchmarking? Are you using FORTRAN / C?

Also, could you please provide the following details which will help us to investigate better:

System configuration(OS Version),

Memory configuration(RAM, Cache),

BLAS Version,

Compiler Version.


Thanks are Regards

Goutham


0 포인트
Richard_S_7
초급자
3,833 조회수

Hi Goutham,

many thanks for your reply. I am using the following C++ code for benchmarking:

#include <mkl.h>
#include <iostream>
#include <algorithm>
#include <vector>
#include <chrono>

int main(int argc, const char **argv) {
    // input size
    const int M = 10;
    const int N = 500;
    const int K = 64;

    // initialize data
    auto a = (float*) mkl_malloc(M * K * sizeof(float), 64); for (int i = 0; i < M * K; ++i) a[i] = (i + 1) % 10;
    auto b = (float*) mkl_malloc(K * N * sizeof(float), 64); for (int i = 0; i < K * N; ++i) b[i] = (i + 1) % 10;
    auto c = (float*) mkl_malloc(M * N * sizeof(float), 64); for (int i = 0; i < M * N; ++i) c[i] = 0;

    // warm ups
    for (int i = 0; i < 10; ++i) {
        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                    M, N, K,
                    1.0f,
                    a, K,
                    b, N,
                    0.0f,
                    c, N);
    }

    // evaluations
    std::vector<long long> evaluations;
    for (int i = 0; i < 2000; ++i) {
        auto start = std::chrono::high_resolution_clock::now();
        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                    M, N, K,
                    1.0f,
                    a, K,
                    b, N,
                    0.0f,
                    c, N);
        auto end = std::chrono::high_resolution_clock::now();
        evaluations.push_back(std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count());
    }
    std::cout << "min runtime: " << *std::min_element(evaluations.begin(), evaluations.end()) << "ns" << std::endl;

    mkl_free(a);
    mkl_free(b);
    mkl_free(c);
}

 

This is my system:

  • OS: CentOS Linux release 7.8.2003
  • RAM: 192GB
  • Compiler: icpc (ICC) 19.0.1.144 20181018
  • BLAS: MKL 2020 Initial Release

Best,
Richard

0 포인트
GouthamK_Intel
중재자
3,786 조회수

Hi Richard,

Thanks for providing the source code and system environment details.

We tried to reproduce the same on our end with slightly different hardware on Ubuntu 18.04 OS but with same compiler version and MKL version. We didn't see any such sudden drops and spikes on our end.

However, we are escalating this thread to Subject Matter Experts who will guide you further.

Have a Good day!


Thanks & Regards

Goutham


0 포인트
Gennady_F_Intel
중재자
3,769 조회수

if you will run this code once again, will you see the same picture?


0 포인트
Richard_S_7
초급자
3,715 조회수

Hi Gennady,

sorry for the delay. Yes, the result is still the same.

Best,
Richard

0 포인트
Gennady_F_Intel
중재자
3,597 조회수

Could you try to set the KMP Affinity masks as follows:

export KMP_AFFINITY=compact,1,0,granularity=fine


usually using this affinity help to get the best performance on systems with multi-core processors by requiring that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads.




0 포인트
Richard_S_7
초급자
3,584 조회수

Many thanks for still trying to solve this problem!

I tried setting the affinity mask and it does indeed seem to improve the measurement:

runtimes_10x500x64_kmp_affinity.png

It is pretty stable now for the first 200 evaluations as measured above, but I noticed it still drops off a bit at the end. Do you maybe have any insights what could be the cause of this?

0 포인트
Gennady_F_Intel
중재자
3,573 조회수

I think it's OS's fluctuations.


0 포인트
Gennady_F_Intel
중재자
3,559 조회수

The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


0 포인트
응답