<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic It is, and example would be: in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070963#M22271</link>
    <description>&lt;P&gt;It is, an example would be:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;mkl.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;
#include &amp;lt;iostream&amp;gt;
#include &amp;lt;random&amp;gt;

int main(void) {
    int ompth = 4; //Number of OMP threads for the for loop
    int mklth = 2; //Number of MKL threads for the mkl calls
    //These two parameters need not be constant (i.e. you can read them as arguments if you wish)
    mkl_set_dynamic(false);
    omp_set_nested(true);
    omp_set_max_active_levels(2);
#pragma omp parallel for num_threads(ompth) //Set the number of threads for this loop manually
    for (int i = 0; i &amp;lt; 10; i++) {
        mkl_set_num_threads_local(mklth); //Set the number of threads for MKL to use within this region
        //Now we need to run some MKL routine
        std::mt19937_64 gen(i * 12345);
        std::uniform_real_distribution&amp;lt;double&amp;gt; dist(-1, 1);
        int matsize = 10000;
        MKL_Complex16* A = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        MKL_Complex16* B = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        MKL_Complex16* C = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        for (int j = 0; j &amp;lt; matsize * matsize; j++) { //Give them some random numbers
            A&lt;J&gt;.real = dist(gen);
            A&lt;J&gt;.imag = dist(gen);
            B&lt;J&gt;.real = dist(gen);
            B&lt;J&gt;.imag = dist(gen);
        }
        MKL_Complex16 scale{1, 0};
        MKL_Complex16 zero{0, 0};
        cblas_zgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, matsize, matsize, matsize, &amp;amp;scale, A, matsize, B, matsize, &amp;amp;zero, C, matsize);
        std::cout &amp;lt;&amp;lt; "Iteration " &amp;lt;&amp;lt; i &amp;lt;&amp;lt; " completed by OMP thread " &amp;lt;&amp;lt; omp_get_thread_num() &amp;lt;&amp;lt; ". " &amp;lt;&amp;lt; std::endl;
    }
    return 0;
}&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/PRE&gt;

&lt;P&gt;On my machine (Linux) this uses 8 threads. Do make sure the calls to the MKL routines warrant additional threads, sometimes, if for instance your matrices are too small, MKL will only use 1 thread regardless of how many you assigned to it because it simply doesn't make sense to create additional threads for small jobs.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The compilation flags are:&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;icpc -static -std=c++14 -Wall -O3 -qopenmp -ip -xHOST -use-intel-optimized-headers -fma -qoverride-limits -c test.cpp -o test.o
icpc test.o -lm -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -qopenmp -lpthread -o test&lt;/PRE&gt;

&lt;P&gt;Hope this helps.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best Regards&lt;/P&gt;</description>
    <pubDate>Tue, 04 Jul 2017 08:41:00 GMT</pubDate>
    <dc:creator>marko_l_</dc:creator>
    <dc:date>2017-07-04T08:41:00Z</dc:date>
    <item>
      <title>Nested parallelisation problem OMP + MKL</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070959#M22267</link>
      <description>&lt;P&gt;I am attempting to parallelise calls to mkl within a parallel omp region to test whether or not the code executes faster. Simply parallelising part of the code does not yield linear increase in performance, hence a mixed approach makes sense. An outline of the code is as follows:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#pragma omp parallel for
for (int i = 0; i &amp;lt; N; i+=2) {
     some_function(i);
}&lt;/PRE&gt;

&lt;P&gt;where some_function will make calls to zgesvd. For starters I would like the omp region to run on 2 threads and the calls to zgesvd inside to also run on 2 threads (for a total of 4 active threads). To achieve this I make the following calls in the begining of the program&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;omp_set_num_threads(2);
mkl_set_num_threads(2);
mkl_set_dynamic(false);
omp_set_nested(true);
omp_set_max_active_levels(2);&lt;/PRE&gt;

&lt;P&gt;I have also tried setting omp threads to 4 and then adding threads(2) to the pragma with no success. Currently, the program creates &amp;gt;&amp;gt;3&amp;lt;&amp;lt; (??) threads on both Windows and Linux using the latest MKL &amp;amp; Intel compilers. Changing the value of omp_set_max_active_levels to 3 produces 4 threads on Windows and 3 threads on Linux. However, I don't exactly know what these threads are doing, I can just see their number.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best regards&lt;/P&gt;

&lt;P&gt;P.S. I noticed that by default the MKL will only try to use 4 threads on a quad-core CPU with hyperthreading enabled but according to top (which should be reliable? I don't really know.) the 4 threads are not always run 1/core (though that might be up to the OS), so why the limit?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 16 Jan 2017 12:46:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070959#M22267</guid>
      <dc:creator>marko_l_</dc:creator>
      <dc:date>2017-01-16T12:46:06Z</dc:date>
    </item>
    <item>
      <title>Dear customer,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070960#M22268</link>
      <description>&lt;P&gt;Dear customer,&lt;/P&gt;

&lt;P&gt;According to your description, your program probably should be written like:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include "mkl.h"
#include &amp;lt;omp.h&amp;gt;
#include &amp;lt;stdio.h&amp;gt;
void report_num_threads(int level)
{
	#pragma omp parallel num_threads(2)
    {
        //2 sub threads for each omp level
        printf("level: %d, number of threads in the team - %d, thread: %d\n",
                  level,omp_get_num_threads(), omp_get_thread_num());
        some_mkl_function();
    }
 }
int main()
{
       omp_set_dynamic(0);
	omp_set_num_threads(4);
	int N=4;
	printf("total threads: %d\n",omp_get_max_threads() );	
	omp_set_nested(1);
	 #pragma omp parallel num_threads(2)
    	{
		
		//omp region - 2 thread
		report_num_threads(omp_get_thread_num());
	}


    
        return(0);
}&lt;/PRE&gt;

&lt;P&gt;It would be run like:&lt;/P&gt;

&lt;P&gt;level 0, sub threads 0;&lt;BR /&gt;
	level 0, sub threads 1;&lt;BR /&gt;
	level 1, sub threads 0;&lt;BR /&gt;
	level 1, sub threads 1.&lt;/P&gt;

&lt;P&gt;The &lt;SPAN style="color: rgb(0, 0, 0); font-family: Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace; font-size: 13.008px; background-color: rgb(248, 248, 248);"&gt;mkl_set_num_threads(2)&lt;/SPAN&gt;&amp;nbsp;has same functionality with omp_set_num_threads, your program actually totally set 2 threads, not 4. And may I ask the value of N? If N equals to mkl_get_max_threads(), the N probably equals to 2 not 4. Thus the some_function() actually run 1 time for each omp level.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2017 07:40:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070960#M22268</guid>
      <dc:creator>Zhen_Z_Intel</dc:creator>
      <dc:date>2017-01-17T07:40:32Z</dc:date>
    </item>
    <item>
      <title>Dear Fiona, </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070961#M22269</link>
      <description>&lt;P&gt;Dear Fiona,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thank you for your reply. If omp_set_num_threads and mkl_set_num_threads share the same functionality, how would one go about using threaded MKL from within a threaded OMP region?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;As for the code; what I actually want is threaded MKL functions (BLAS and LAPACK to use 2 or more threads) but called from within an OMP parallelised for loop.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Regarding the value of N, it is a parameter, but in general it holds that N&amp;gt;=80.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best regards&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2017 07:52:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070961#M22269</guid>
      <dc:creator>marko_l_</dc:creator>
      <dc:date>2017-01-17T07:52:30Z</dc:date>
    </item>
    <item>
      <title>Quote:marko l. wrote:</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070962#M22270</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;marko l. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;what I actually want is threaded MKL functions (BLAS and LAPACK to use 2 or more threads) but called from within an OMP parallelised for loop.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I'm also looking for a way to do this, is this possible? and if yes, how would you go about setting up and binding the threads?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 04 Jul 2017 07:50:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070962#M22270</guid>
      <dc:creator>Tue_B_</dc:creator>
      <dc:date>2017-07-04T07:50:16Z</dc:date>
    </item>
    <item>
      <title>It is, and example would be:</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070963#M22271</link>
      <description>&lt;P&gt;It is, an example would be:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;mkl.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;
#include &amp;lt;iostream&amp;gt;
#include &amp;lt;random&amp;gt;

int main(void) {
    int ompth = 4; //Number of OMP threads for the for loop
    int mklth = 2; //Number of MKL threads for the mkl calls
    //These two parameters need not be constant (i.e. you can read them as arguments if you wish)
    mkl_set_dynamic(false);
    omp_set_nested(true);
    omp_set_max_active_levels(2);
#pragma omp parallel for num_threads(ompth) //Set the number of threads for this loop manually
    for (int i = 0; i &amp;lt; 10; i++) {
        mkl_set_num_threads_local(mklth); //Set the number of threads for MKL to use within this region
        //Now we need to run some MKL routine
        std::mt19937_64 gen(i * 12345);
        std::uniform_real_distribution&amp;lt;double&amp;gt; dist(-1, 1);
        int matsize = 10000;
        MKL_Complex16* A = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        MKL_Complex16* B = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        MKL_Complex16* C = (MKL_Complex16*)mkl_calloc(matsize * matsize, sizeof(MKL_Complex16), 64); //Alignment for AVX512 calls (soon)
        for (int j = 0; j &amp;lt; matsize * matsize; j++) { //Give them some random numbers
            A&lt;J&gt;.real = dist(gen);
            A&lt;J&gt;.imag = dist(gen);
            B&lt;J&gt;.real = dist(gen);
            B&lt;J&gt;.imag = dist(gen);
        }
        MKL_Complex16 scale{1, 0};
        MKL_Complex16 zero{0, 0};
        cblas_zgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, matsize, matsize, matsize, &amp;amp;scale, A, matsize, B, matsize, &amp;amp;zero, C, matsize);
        std::cout &amp;lt;&amp;lt; "Iteration " &amp;lt;&amp;lt; i &amp;lt;&amp;lt; " completed by OMP thread " &amp;lt;&amp;lt; omp_get_thread_num() &amp;lt;&amp;lt; ". " &amp;lt;&amp;lt; std::endl;
    }
    return 0;
}&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/PRE&gt;

&lt;P&gt;On my machine (Linux) this uses 8 threads. Do make sure the calls to the MKL routines warrant additional threads, sometimes, if for instance your matrices are too small, MKL will only use 1 thread regardless of how many you assigned to it because it simply doesn't make sense to create additional threads for small jobs.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The compilation flags are:&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;icpc -static -std=c++14 -Wall -O3 -qopenmp -ip -xHOST -use-intel-optimized-headers -fma -qoverride-limits -c test.cpp -o test.o
icpc test.o -lm -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -qopenmp -lpthread -o test&lt;/PRE&gt;

&lt;P&gt;Hope this helps.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best Regards&lt;/P&gt;</description>
      <pubDate>Tue, 04 Jul 2017 08:41:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Nested-parallelisation-problem-OMP-MKL/m-p/1070963#M22271</guid>
      <dc:creator>marko_l_</dc:creator>
      <dc:date>2017-07-04T08:41:00Z</dc:date>
    </item>
  </channel>
</rss>

