- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
I tried the below example program on KNL and I am puzzled about the huge performance difference. It computes a small matrix-matrix product using the MKL. In this (naive) example there is a 1000x performance difference when switching from OpenMP to TBB. The file was compiled with
icc -std=c++11 -O3 -xmic-avx512 -mkl -qopenmp tbb_vs_omp.cpp -o omp icc -std=c++11 -O3 -xmic-avx512 -mkl -tbb tbb_vs_omp.cpp -o tbb
I tried a few things, e.g. using tbb::task_scheduler_init or OpenMP env variables, but nothing seems to make the TBB version nearly as fast as the OpenMP version, or the OpenMP version as slow. Does anyone know what might the problem and how to fix it, that is how to configure TBB? The gap gets smaller when increasing the problem size (only 10x for N=1024).
#include <iostream> #include <mkl.h> constexpr size_t N = 64; constexpr size_t RUNS = 20; int main() { double* A = (double*)_mm_malloc(N * N * sizeof(double), 64); double* B = (double*)_mm_malloc(N * N * sizeof(double), 64); double* C = (double*)_mm_malloc(N * N * sizeof(double), 64); VSLStreamStatePtr stream; vslNewStream(&stream, VSL_BRNG_SFMT19937, 1337); vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, A, -10, 10); vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, B, -10, 10); vslDeleteStream(&stream); std::cout << "Created matrices, N = " << N << ".\n"; { double total = 0.0; cblas_dgemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans, CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, N /* lda */, B, N /* ldb */, 0.0, C, N /* ldc */); for (size_t i = 0; i < RUNS; ++i) { // A[0] = i; double start = dsecnd(); cblas_dgemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans, CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, N /* lda */, B, N /* ldb */, 0.0, C, N /* ldc */); total += dsecnd() - start; } std::cout << "Time needed " << total << ", "; } std::cout << C[0] << '\n'; _mm_free(A); _mm_free(B); _mm_free(C); return 0; }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
change line 21 to:
for(int iRep=0; iRep<3; ++iRep) {
and see what happens.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thank you, I tried this and the next calls are faster, but there is still a huge difference. Some numbers (all examples ran on KNL):
- TBB: First loop 0.2s, next loops around 0.015s
- OMP around: 0.00055s each time
Could you give me a hint why the next runs are faster? I expected the first call to be slow because of thread creation, but why does it take so many calls? On a i5 the two versions take about the same time.
edit: same results using gcc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
See: https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/281761
TBB may be incorporating a similar functionality of KMP_BLOCKTIME. The TBB threads may be consuming processing time as a result. As a verification, add (to the iRep version) a timed wait of 2 seconds before your timed section. This should assure that all non-master threads have suspended. But this will not assure that then (threaded version of) MKL thread pool was initiated. The first MKL call will incur the overhead of initiating the MKL thread pool.
Lastly:
Your timed section is too small to be effectively measured. Thread start/stop/barrier times when running with 64 to 256 threads is significant.
A 64 x 64 doubles are relatively small arrays, and may even be too small to effectively use the parallel version of mkl. Assure that the sequential version of MKL is used for this test program (-mkl:sequential)
Also note:
If you predominantly call MKL from multiple threads within TBB (e.g. parallel_for and/or other concurrent task)...
... then link with the serial version of MKL
IOW assure that MKL does not spawn a new thread pool for each of its host's threads.
If you predominantly call MKL from a single thread within TBB (e.g. main thread or other dedicated thread)...
... then link with the parallel version of MKL
While this may seem backwards, it is not. Both versions of MKL are thread-safe. The differentiation is if MKL is to spawn or not spawn a thread pool in the context of the calling thread.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page