Nested parallelisation problem OMP + MKL

marko_l_ · ‎01-16-2017

I am attempting to parallelise calls to mkl within a parallel omp region to test whether or not the code executes faster. Simply parallelising part of the code does not yield linear increase in performance, hence a mixed approach makes sense. An outline of the code is as follows:

#pragma omp parallel for
for (int i = 0; i < N; i+=2) {
     some_function(i);
}

where some_function will make calls to zgesvd. For starters I would like the omp region to run on 2 threads and the calls to zgesvd inside to also run on 2 threads (for a total of 4 active threads). To achieve this I make the following calls in the begining of the program

omp_set_num_threads(2);
mkl_set_num_threads(2);
mkl_set_dynamic(false);
omp_set_nested(true);
omp_set_max_active_levels(2);

I have also tried setting omp threads to 4 and then adding threads(2) to the pragma with no success. Currently, the program creates >>3<< (??) threads on both Windows and Linux using the latest MKL & Intel compilers. Changing the value of omp_set_max_active_levels to 3 produces 4 threads on Windows and 3 threads on Linux. However, I don't exactly know what these threads are doing, I can just see their number.

Best regards

P.S. I noticed that by default the MKL will only try to use 4 threads on a quad-core CPU with hyperthreading enabled but according to top (which should be reliable? I don't really know.) the 4 threads are not always run 1/core (though that might be up to the OS), so why the limit?

Zhen_Z_Intel · ‎01-16-2017

Dear customer,

According to your description, your program probably should be written like:

#include "mkl.h"
#include <omp.h>
#include <stdio.h>
void report_num_threads(int level)
{
	#pragma omp parallel num_threads(2)
    {
        //2 sub threads for each omp level
        printf("level: %d, number of threads in the team - %d, thread: %d\n",
                  level,omp_get_num_threads(), omp_get_thread_num());
        some_mkl_function();
    }
 }
int main()
{
       omp_set_dynamic(0);
	omp_set_num_threads(4);
	int N=4;
	printf("total threads: %d\n",omp_get_max_threads() );	
	omp_set_nested(1);
	 #pragma omp parallel num_threads(2)
    	{
		
		//omp region - 2 thread
		report_num_threads(omp_get_thread_num());
	}


    
        return(0);
}

It would be run like:

level 0, sub threads 0;
level 0, sub threads 1;
level 1, sub threads 0;
level 1, sub threads 1.

The mkl_set_num_threads(2) has same functionality with omp_set_num_threads, your program actually totally set 2 threads, not 4. And may I ask the value of N? If N equals to mkl_get_max_threads(), the N probably equals to 2 not 4. Thus the some_function() actually run 1 time for each omp level.

marko_l_ · ‎01-16-2017

Dear Fiona,

Thank you for your reply. If omp_set_num_threads and mkl_set_num_threads share the same functionality, how would one go about using threaded MKL from within a threaded OMP region?

As for the code; what I actually want is threaded MKL functions (BLAS and LAPACK to use 2 or more threads) but called from within an OMP parallelised for loop.

Regarding the value of N, it is a parameter, but in general it holds that N>=80.

Best regards

Tue_B_ · ‎07-04-2017

marko l. wrote:

what I actually want is threaded MKL functions (BLAS and LAPACK to use 2 or more threads) but called from within an OMP parallelised for loop.

I'm also looking for a way to do this, is this possible? and if yes, how would you go about setting up and binding the threads?

marko_l_ · ‎07-04-2017

It is, an example would be:

#include <mkl.h>
#include <omp.h>
#include <iostream>
#include <random>

int main(void) {
    int ompth = 4; //Number of OMP threads for the for loop
    int mklth = 2; //Number of MKL threads for the mkl calls
    //These two parameters need not be constant (i.e. you can read them as arguments if you wish)
    mkl_set_dynamic(false);
    omp_set_nested(true);
    omp_set_max_active_levels(2);
#pragma omp parallel for num_threads(ompth) //Set the number of threads for this loop manually
    for (int i = 0; i < 10; i++) {
        mkl_set_num_threads_local(mklth); //Set the number of threads for MKL to use within this region
        //Now we need to run some MKL routine
        std::mt19937_64 gen(i * 12345);
        std::uniform_real_distribution<double> dist(-1, 1);
        int matsize = 10000;
        MKL_Complex16* A = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        MKL_Complex16* B = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        MKL_Complex16* C = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        for (int j = 0; j < matsize * matsize; j++) { //Give them some random numbers
            A.real = dist(gen);
            A.imag = dist(gen);
            B.real = dist(gen);
            B.imag = dist(gen);
        }
        MKL_Complex16 scale{1, 0};
        MKL_Complex16 zero{0, 0};
        cblas_zgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, matsize, matsize, matsize, &scale, A, matsize, B, matsize, &zero, C, matsize);
        std::cout << "Iteration " << i << " completed by OMP thread " << omp_get_thread_num() << ". " << std::endl;
    }
    return 0;
}

On my machine (Linux) this uses 8 threads. Do make sure the calls to the MKL routines warrant additional threads, sometimes, if for instance your matrices are too small, MKL will only use 1 thread regardless of how many you assigned to it because it simply doesn't make sense to create additional threads for small jobs.

The compilation flags are:

icpc -static -std=c++14 -Wall -O3 -qopenmp -ip -xHOST -use-intel-optimized-headers -fma -qoverride-limits -c test.cpp -o test.o
icpc test.o -lm -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -qopenmp -lpthread -o test

Hope this helps.

Best Regards