Intel MKL performance drop OpenMP vs TBB

Jannik · ‎07-04-2017

Hi everyone,

I tried the below example program on KNL and I am puzzled about the huge performance difference. It computes a small matrix-matrix product using the MKL. In this (naive) example there is a 1000x performance difference when switching from OpenMP to TBB. The file was compiled with

 icc -std=c++11 -O3 -xmic-avx512 -mkl -qopenmp tbb_vs_omp.cpp -o omp
 icc -std=c++11 -O3 -xmic-avx512 -mkl -tbb tbb_vs_omp.cpp -o tbb

I tried a few things, e.g. using tbb::task_scheduler_init or OpenMP env variables, but nothing seems to make the TBB version nearly as fast as the OpenMP version, or the OpenMP version as slow. Does anyone know what might the problem and how to fix it, that is how to configure TBB? The gap gets smaller when increasing the problem size (only 10x for N=1024).

#include <iostream>

#include <mkl.h>

constexpr size_t N    = 64;
constexpr size_t RUNS = 20;

int main() {
  double* A = (double*)_mm_malloc(N * N * sizeof(double), 64);
  double* B = (double*)_mm_malloc(N * N * sizeof(double), 64);
  double* C = (double*)_mm_malloc(N * N * sizeof(double), 64);

  VSLStreamStatePtr stream;
  vslNewStream(&stream, VSL_BRNG_SFMT19937, 1337);
  vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, A, -10, 10);
  vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, B, -10, 10);
  vslDeleteStream(&stream);

  std::cout << "Created matrices, N = " << N << ".\n";

  {
    double total = 0.0;
    cblas_dgemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
                CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, N /* lda */, B,
                N /* ldb */, 0.0, C, N /* ldc */);
    for (size_t i = 0; i < RUNS; ++i) {
      // A[0] = i;
      double start = dsecnd();
      cblas_dgemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
                  CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, N /* lda */,
                  B, N /* ldb */, 0.0, C, N /* ldc */);
      total += dsecnd() - start;
    }
    std::cout << "Time needed " << total << ", ";
  }

  std::cout << C[0] << '\n';

  _mm_free(A);
  _mm_free(B);
  _mm_free(C);
  return 0;
}

jimdempseyatthecove · ‎07-05-2017

change line 21 to:

for(int iRep=0; iRep<3; ++iRep) {

and see what happens.

Jim Dempsey

Jannik · ‎07-06-2017

thank you, I tried this and the next calls are faster, but there is still a huge difference. Some numbers (all examples ran on KNL):

TBB: First loop 0.2s, next loops around 0.015s
OMP around: 0.00055s each time

Could you give me a hint why the next runs are faster? I expected the first call to be slow because of thread creation, but why does it take so many calls? On a i5 the two versions take about the same time.

edit: same results using gcc.

jimdempseyatthecove · ‎07-06-2017

See: https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/281761

TBB may be incorporating a similar functionality of KMP_BLOCKTIME. The TBB threads may be consuming processing time as a result. As a verification, add (to the iRep version) a timed wait of 2 seconds before your timed section. This should assure that all non-master threads have suspended. But this will not assure that then (threaded version of) MKL thread pool was initiated. The first MKL call will incur the overhead of initiating the MKL thread pool.

Lastly:

Your timed section is too small to be effectively measured. Thread start/stop/barrier times when running with 64 to 256 threads is significant.

A 64 x 64 doubles are relatively small arrays, and may even be too small to effectively use the parallel version of mkl. Assure that the sequential version of MKL is used for this test program (-mkl:sequential)

Also note:

If you predominantly call MKL from multiple threads within TBB (e.g. parallel_for and/or other concurrent task)...
... then link with the serial version of MKL

IOW assure that MKL does not spawn a new thread pool for each of its host's threads.

If you predominantly call MKL from a single thread within TBB (e.g. main thread or other dedicated thread)...
... then link with the parallel version of MKL

While this may seem backwards, it is not. Both versions of MKL are thread-safe. The differentiation is if MKL is to spawn or not spawn a thread pool in the context of the calling thread.

Jim Dempsey