Showing results for 
Search instead for 
Did you mean: 

Nested parallelisation problem OMP + MKL

I am attempting to parallelise calls to mkl within a parallel omp region to test whether or not the code executes faster. Simply parallelising part of the code does not yield linear increase in performance, hence a mixed approach makes sense. An outline of the code is as follows:

#pragma omp parallel for
for (int i = 0; i < N; i+=2) {

where some_function will make calls to zgesvd. For starters I would like the omp region to run on 2 threads and the calls to zgesvd inside to also run on 2 threads (for a total of 4 active threads). To achieve this I make the following calls in the begining of the program 


I have also tried setting omp threads to 4 and then adding threads(2) to the pragma with no success. Currently, the program creates >>3<< (??) threads on both Windows and Linux using the latest MKL & Intel compilers. Changing the value of omp_set_max_active_levels to 3 produces 4 threads on Windows and 3 threads on Linux. However, I don't exactly know what these threads are doing, I can just see their number. 

Best regards

P.S. I noticed that by default the MKL will only try to use 4 threads on a quad-core CPU with hyperthreading enabled but according to top (which should be reliable? I don't really know.) the 4 threads are not always run 1/core (though that might be up to the OS), so why the limit? 

0 Kudos
4 Replies

Dear customer,

According to your description, your program probably should be written like:

#include "mkl.h"
#include <omp.h>
#include <stdio.h>
void report_num_threads(int level)
	#pragma omp parallel num_threads(2)
        //2 sub threads for each omp level
        printf("level: %d, number of threads in the team - %d, thread: %d\n",
                  level,omp_get_num_threads(), omp_get_thread_num());
int main()
	int N=4;
	printf("total threads: %d\n",omp_get_max_threads() );	
	 #pragma omp parallel num_threads(2)
		//omp region - 2 thread


It would be run like:

level 0, sub threads 0;
level 0, sub threads 1;
level 1, sub threads 0;
level 1, sub threads 1.

The mkl_set_num_threads(2) has same functionality with omp_set_num_threads, your program actually totally set 2 threads, not 4. And may I ask the value of N? If N equals to mkl_get_max_threads(), the N probably equals to 2 not 4. Thus the some_function() actually run 1 time for each omp level. 


Dear Fiona, 

Thank you for your reply. If omp_set_num_threads and mkl_set_num_threads share the same functionality, how would one go about using threaded MKL from within a threaded OMP region? 

As for the code; what I actually want is threaded MKL functions (BLAS and LAPACK to use 2 or more threads) but called from within an OMP parallelised for loop. 

Regarding the value of N, it is a parameter, but in general it holds that N>=80. 

Best regards


marko l. wrote:

what I actually want is threaded MKL functions (BLAS and LAPACK to use 2 or more threads) but called from within an OMP parallelised for loop. 

I'm also looking for a way to do this, is this possible? and if yes, how would you go about setting up and binding the threads?



It is, an example would be:

#include <mkl.h>
#include <omp.h>
#include <iostream>
#include <random>

int main(void) {
    int ompth = 4; //Number of OMP threads for the for loop
    int mklth = 2; //Number of MKL threads for the mkl calls
    //These two parameters need not be constant (i.e. you can read them as arguments if you wish)
#pragma omp parallel for num_threads(ompth) //Set the number of threads for this loop manually
    for (int i = 0; i < 10; i++) {
        mkl_set_num_threads_local(mklth); //Set the number of threads for MKL to use within this region
        //Now we need to run some MKL routine
        std::mt19937_64 gen(i * 12345);
        std::uniform_real_distribution<double> dist(-1, 1);
        int matsize = 10000;
        MKL_Complex16* A = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        MKL_Complex16* B = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        MKL_Complex16* C = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        for (int j = 0; j < matsize * matsize; j++) { //Give them some random numbers
            A.real = dist(gen);
            A.imag = dist(gen);
            B.real = dist(gen);
            B.imag = dist(gen);
        MKL_Complex16 scale{1, 0};
        MKL_Complex16 zero{0, 0};
        cblas_zgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, matsize, matsize, matsize, &scale, A, matsize, B, matsize, &zero, C, matsize);
        std::cout << "Iteration " << i << " completed by OMP thread " << omp_get_thread_num() << ". " << std::endl;
    return 0;

On my machine (Linux) this uses 8 threads. Do make sure the calls to the MKL routines warrant additional threads, sometimes, if for instance your matrices are too small, MKL will only use 1 thread regardless of how many you assigned to it because it simply doesn't make sense to create additional threads for small jobs. 

The compilation flags are: 

icpc -static -std=c++14 -Wall -O3 -qopenmp -ip -xHOST -use-intel-optimized-headers -fma -qoverride-limits -c test.cpp -o test.o
icpc test.o -lm -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -qopenmp -lpthread -o test

Hope this helps. 

Best Regards