Solved: enabling MKL breaks existing openMP code

morley__dustin · ‎07-25-2018

In Windows using Visual studio 2015, I have a project that extensively uses OpenMP to speed up for loops. Works great - we've been using it for a long time and have had nothing but good experiences with it.

However, I've discovered that (after installing MKL) going into project properties and setting "use intel MKL" to parallel completely breaks these loops such that only a single core is now used (which in my case renders the program unusable), even though I have not yet added a single line of MKL code! Even worse, none of the suggested "workarounds" I have seen appear to fix it. Functions like omp_set_num_threads, mkl_set_num_threads and omp_set_dynamic appear to do absolutely nothing. If I instead set MKL to sequential rather than parallel, I appear to not have this problem, however due to how mysterious this behavior seems I don't feel comfortable just setting it that way and leaving it alone without getting some answers first (not to mention that I may well run into a scenario later where I want to run an MKL routine in parallel).

Can someone please explain what is going on here and how to correctly link MKL into a program configured as parallel while still having said program be able to use all cores in omp parallel for loops? What is the actual logic for how Intel allocates threads to openMP and MKL processes, and how can the programmer step in and control this when Intel's default logic does not produce the desired outcome? Fundamentally, I don't at all understand why the enabling of MKL to potentially run parallel code should lead to any interference with OpenMP parallel for loops when there is no actual parallel MKL code being run.

Thanks in advance.

Olga_M_Intel · ‎08-02-2018

Intel OpenMP runtime library respects affinity mask of the initial thread. So, if the affinity mask was limited by SetAffinityMask() call it caused your problem.

To workaround this problem add the following call to your code just before parallel region

kmp_set_defaults("KMP_AFFINITY=norespect|KMP_SETTINGS=1");

P.S. KMP_SETTINGS=1 will print user/effective settings for OpenMP runtime. You can set it in Environment setting of your project instead.

PPS. MFC has nothing to do with this problem. This "issue" can be easily reproduced with a console app.

View solution in original post

Ying_H_Intel · ‎07-25-2018

Hi Dustin,

thank you for post the "mysterious" issues and your tries. I happen to build one test case. I attached here. for your reference and if possible, you may create one reproduce case based on that.

Here are the facts
1. these loops such that only a single core is now used
2. set MKL to sequential rather than parallel, I appear to not have this problem
3. even though I have not yet added a single line of MKL code!
4. omp_set_num_threads, mkl_set_num_threads and omp_set_dynamic appear to do absolutely nothing
5. project that extensively uses OpenMP to speed up for loops

from side of MSVC
6. setting "use intel MKL" to parallel
7. set MKL to sequential

and our doc:
https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-using-intel-mkl-with-threaded-applications/
https://software.intel.com/en-us/articles/using-threaded-intel-mkl-in-multi-thread-application

As you did, basically, i think it is openmp related issue , so let's check the openmp first
1) What kind of compiler are you using before MKL option switch one? Microsoft C++ compiler by default, or Intel compiler?
or further what kind of openmP runtime library are you using? MSVC may use vcomp*.lib and intel use libiomp5md.
Searching C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\lib\amd64\VCOMPD.lib:

2) . these loops such that only a single core is now used => do you have OpenMP nest case? the most of root case is that intel libiomp5 will set 1 thread by default if it detected it is in nest openmp environment. Could you please add some get openmp in your code, like
printf("openmp threads are %d \n", omp_get_num_threads());

in your openmp loop:
printf("My ID is %d \n", omp_get_thread_num());

6. setting "use intel MKL" to parallel actually   =
Searching libraries
1>      Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.3.210\windows\mkl\lib\intel64_win\mkl_intel_lp64_dll.lib:
1>      Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.3.210\windows\mkl\lib\intel64_win\mkl_intel_thread_dll.lib:
1>      Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.3.210\windows\mkl\lib\intel64_win\mkl_core_dll.lib:
1>      Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.3.210\windows\compiler\lib\intel64_win\libiomp5md.lib:

if there is no MKL function call, then only the libiomp5md.lib actually broken the program.

3) could please open your property package => Linker=> General => show Progress => For Libraries Searched (/VERBOSE:Lib) and copy the mkl part and VCOMPD.lib part here. So we can look into the problem further.

Best Regards,
Ying

Ying_H_Intel · ‎07-25-2018

Attach the MSVC sln and cpp files

morley__dustin · ‎07-26-2018

Hi Ying,

Thanks for your quick and detailed response! Here is some further information regarding the items you alluded to.

1) I am pretty sure it is Microsoft C++ compiler rather than the intel one. The linked library is vcomp.lib - see 3) below

2) I'm not 100% sure precisely what you mean by nest case. If you mean omp loop within another omp loop, then the answer is no. If you just mean applying openMP to nested for loops in the broad sense, then yes: for example, the particular case I am focusing my troubleshooting efforts on is a quadruple for loop with an omp statement declared above the outermost loop only.

At any rate, here are the results of the print statements you requested:

Parallel:
openmp threads are 1
My ID is 0
My ID is 0
My ID is 0
My ID is 0
My ID is 0
My ID is 0
My ID is 0
My ID is 0
My ID is 0
My ID is 0
...

Sequential:
openmp threads are 1
My ID is 8
My ID is 0
My ID is 2
My ID is 3
My ID is 7
My ID is 9
My ID is 4
My ID is 11
My ID is 10
My ID is 1
My ID is 7
My ID is 11
My ID is 5
My ID is 9
My ID is 10
My ID is 6
...

If I check number of threads inside the omp loop, not surprisingly it is still 1 in the parallel case but 12 in the sequential case. This exercise shows that in the sequential case, OpenMP dynamically spawns multiple threads as needed to launch an omp loop and then destroys the threads after, while in the parallel case it is unable to do this for some reason.

3) Here are the requested sections of linker output:

parallel case:

1> Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mkl\lib\intel64_win\mkl_intel_lp64_dll.lib:
1> Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mkl\lib\intel64_win\mkl_intel_thread_dll.lib:
1> Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mkl\lib\intel64_win\mkl_core_dll.lib:
1> Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\compiler\lib\intel64_win\libiomp5md.lib:

1> Searching C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\lib\amd64\VCOMP.lib:

sequential case:

1> Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mkl\lib\intel64_win\mkl_intel_lp64_dll.lib:
1> Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mkl\lib\intel64_win\mkl_sequential_dll.lib:
1> Searching C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mkl\lib\intel64_win\mkl_core_dll.lib:

1> Searching C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\lib\amd64\VCOMP.lib:

So I guess naively, the problem must be caused by either mkl_intel_thread_dll.lib or libiomp5md.lib. But of course that doesn't tell me what I should do about it to get things working correctly.

Does that help you identify a solution?

Thanks,

Dustin

Ying_H_Intel · ‎07-27-2018

Hi Dustin,

Great test and add more information. As i understand, the openmp function like omp_set_num_threads(12) should have high priority to keep your parallel code run as before. could it our bugs?

But you can decide if it is mkl issue or openmp issue by

1) switch off mkl option, the in your project property => linker=> Input=>additional Dependencies add libiomp5md.lib manually.
and add it's path linker=> general=>additional Libraries Directories => C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\compiler\lib\intel64_win.
and see if the ID have value other than 0.

2) what kind of CPU are you using seems 12 cores cpu, right?

3) what is the output before loop?
printf("OpenMP number is %d \n", omp_get_max_threads());

4) reproduce case:
you mentioned is a quadruple for loop with an omp statement declared above the outermost loop only.

I create one sample code as below. I wonder what is the size of your outermost loop and where you put the printf("My ID is %d \n", omp_get_thread_num());
could you please run under your machine and let us know the result.

Best Regards,
Ying

#pragma omp parallel for //num_threads(4)
for (int b = 0; b < 16; ++b) {
printf("My ID is %d \n", omp_get_thread_num());

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <omp.h>
#include <numeric>
#include <chrono>

using namespace std;

int main(int argc, char **argv) {

float *src =(float *)malloc(sizeof(float)*16 * 64 * 40 * 100);
int batch = 16;
int channel = 64;
int height = 40;
int width = 100;
int index = 0;

omp_set_nested(1);
//   omp_set_max_active_levels(2);
printf("OpenMP number is %d \n", omp_get_max_threads());
printf("OpenMP number is %d \n", omp_get_num_threads());

#pragma omp parallel for //num_threads(4)
for (int b = 0; b < 16; ++b) {
  printf("My ID is %d \n", omp_get_thread_num());
  for (int c = 0; c < 64; ++c) {
   for (int h = 0; h < height; ++h) {
    for (int w = 0; w < width; ++w) {

     index = ((b*channel + c) * height + h) * width + w;
     src[index] = index * 0.002;
    }
   }
  }
}

return 0;

}

morley__dustin · ‎07-27-2018

Hi Ying,

Following your suggestions I was able to very tightly narrow down the issue even further. For completeness let me first answer the first three questions:

1) Adding this library manually with MKL turned off causes the same single-thread issue. So libiomp5md.lib is definitely the library that breaks it.

2) Actually, 6-core (Intel Core i7-5930K). So it is running 2 threads per core.

3) omp_get_max_threads returns 12 with mkl disabled or sequential, 1 with mkl parallel (presumably just libiomp5md.lib makes the difference).

At this point, I discovered that a freshly created console application running your code snippet doesn't have the same issue, meaning that the problem is specific to our existing (very large) application. Fortunately I was able to find something much more specific than that: there appears to be a conflict between libiomp5md and SetThreadAffinityMask (which we use to occasionally fire off certain processes on specific cores). Here are two ways I found that happen to fix the problem:

1 - Avoid calling SetThreadAffinityMask

2 - Call some openMP function very early in InitInstance, before any calls to SetThreadAffinityMask (i.e. calling omp_get_max_threads here seems to do the trick).

I don't particularly love either of those options but one of them might end up being tolerable (I would have to discuss this further with the other developers). Do you have any additional information as to why there is a conflict between libiomp5md and usage of SetThreadAffinityMask? Are there perhaps some "best practice" principles when using OpenMP within an application that also uses SetThreadAffinityMask that we might not be doing?

Thanks,

Dustin

Ying_H_Intel · ‎07-29-2018

Hi Dustin, Glad to know them. Regarding the root cause, could you to add the SetThreadAffinityMask() in the test case to reproduce the problem? (i create one below, but seem can't reproduce the problem, do you have other windows thread before the #pragma omp parallel for call?). Best Regards, Ying // the below will use only 1 core, but openmp will create the threads anyway. As i understand, the vomp and libiomp should same behaviors with SetThreadAffinityMask(). DWORD_PTR mask = 1; SetThreadAffinityMask(GetCurrentThread(), mask); printf("tid=%d new_mask=%08X \n", omp_get_num_threads(), *(unsigned int*)(&mask)); // omp_set_nested(1); // omp_set_max_active_levels(2); printf("OpenMP number is %d \n", omp_get_max_threads()); //printf("OpenMP number is %s \n", kmp_versions()); #pragma omp parallel for //num_threads(12) // for (int loop = 0; loop < 10; ++loop) { for (int b = 0; b < 16; ++b) { printf("My ID is %d \n", omp_get_thread_num()); for (int c = 0; c < 64; ++c) { for (int h = 0; h < height; ++h) { for (int w = 0; w < width; ++w) { index = ((b*channel + c) * height + h) * width + w; src[index] = index * 0.002; } } } } // } return 0; } penMP number is 4 tid=1 new_mask=00000001 OpenMP number is 4 My ID is 0 My ID is 2 My ID is 1 My ID is 3 My ID is 2 My ID is 1 My ID is 3 My ID is 2 My ID is 1 My ID is 2 My ID is 3 My ID is 1 My ID is 0 My ID is 3 My ID is 0 My ID is 0

morley__dustin · ‎07-30-2018

Hi Ying,

I also could not get the issue to manifest in a console application. However, the issue readily reproduces in a blank MFC application. Rather than attaching the entire project (> 100MB) I will describe the steps to reproduce the issue:

1. Create a new dialog-based MFC project called TestMFCIntelOpenMP

2. Turn on OMP support (project properties --> C/C++ --> language --> Open MP Support)

3. Add the following code in CTestMFCIntelOpenMPDlg::OnInitDialog() (placement shouldn't matter but I put it at the top immediately after the base class call to CDialogEx::OnInitDialog)

CString strTest;

strTest.Format(_T("OpenMP number is %d\n"), omp_get_max_threads());

OutputDebugString(strTest);

#pragma omp parallel for

for (int i = 0; i < 100; i++)

{

strTest.Format(_T("My ID is %d\n"), omp_get_thread_num());

OutputDebugString(strTest);

}

4. In CTestMFCIntelOpenMPApp::InitInstance(), add the line SetThreadAffinityMask(GetCurrentThread(), 1). IMPORTANT - add this line BEFORE the instance of CTestMFCIntelOpenMPDlg is created (I did it immediately before)

You should now observe that setting MKL use to parallel creates the issue (OpenMP number is 1, all IDs 0), which can then be resolved by getting rid of the call to SetThreadAffinityMask.

I hope this helps to isolate the problem! If for some reason this doesn't give you the same results I will try to attach some of the files.

Thanks,

Dustin

Ying_H_Intel · ‎08-01-2018

Hi Dustin, We can reproduce the problem. I post one to C++ forum. and let our compiler team to see what is the reason. Best Regards, Ying

Olga_M_Intel · ‎08-02-2018

Intel OpenMP runtime library respects affinity mask of the initial thread. So, if the affinity mask was limited by SetAffinityMask() call it caused your problem.

To workaround this problem add the following call to your code just before parallel region

kmp_set_defaults("KMP_AFFINITY=norespect|KMP_SETTINGS=1");

P.S. KMP_SETTINGS=1 will print user/effective settings for OpenMP runtime. You can set it in Environment setting of your project instead.

PPS. MFC has nothing to do with this problem. This "issue" can be easily reproduced with a console app.

morley__dustin · ‎08-02-2018

Thanks Olga, that makes sense. I'm assuming then that OpenMP "respecting the affinity mask of the initial thread" is a relatively new feature (within the last 5 to 10 years) since we did not observe this behavior until trying to activate MKL parallel which in turn results in including libiomp5md. I am a bit curious as to why this behavior was not reproduced in a console app in the efforts of myself and Ying, but it does of course make much more sense for the issue to be independent of MFC.

At any rate, my team and I for sure have enough information now to appropriately restructure our code and/or apply a workaround to ensure OpenMP works as it should. Thanks again Olga and Ying!