- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are using MKL libraries in an C++ application running in a Spark environment on Windows. We use Spark to orchestrate the processing of a large dataset on multiple machines, and the C++ application is invoked to do the actual processing. All machines involved run Windows Server.
On some machines, the first #pragma omp parallel for fails with these messages:
OMP: Error #134: Cannot set thread affinity mask.
OMP: System error #31: A device attached to the system is not functioning.
The same code runs perfectly in all other environments (for instance Windows desktops, Azure cloud machines running Windows). I also replaced the omp parallel for with a std::thread implementation and then the code failed in the next omp parallel for invocation.
The MKL DLLs that we are using are from version 2019.0.4.1 and the libomp5md.dll is from version 5.0.2019.312
I realize that this amount of information is insufficient for someone on this forum to debug the issue, but I cannot get much else, and I'd appreciate if someone even points us in the right direction on how to think about this issue. For instance:
- In the above error messages, is the "device" being referred the CPU?
- What causes OMP to fail to set the thread affinity mask? The C++ application runs with admin privileges, so it does have permission to set the affinity mask.
- We are not setting the affinity mask ourselves. The code calls mkl_set_num_threads and omp_set_num_threads following which the parallel for is executed. So my hunch is that setting thread affinity is being done internally by OMP. If that is the case, can we disable it? Will it hurt performance tremendously?
- Is there any machine property (like CPU/chipset) that would cause the set thread affinity mask call to fail?
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please let us know the Windows server details being used, so that we could check the support and guide you accordingly.
Best regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Shanmukh for the quick response!
This is all the information I have:
OS: Windows_NT
PROCESSOR_ARCHITECTURE: AMD64
PROCESSOR_IDENTIFIER: Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 5507
I believe the server version would be greater than Windows Server 2016, most likely be 2019, and the datacenter edition. Due to security reasons, we cannot access these machines directly, and I don't have an easy way of pulling out the configs of these machines.
Will this help?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
>>We are not setting the affinity mask ourselves. The code calls mkl_set_num_threads and omp_set_num_threads following which the parallel for is executed. So my hunch is that setting thread affinity is being done internally by OMP. If that is the case, can we disable it? Will it hurt performance tremendously?
>>OMP: Error #134: Cannot set thread affinity mask.
This issue can be worked around by setting environment variable KMP_AFFINITY=disabled and this may have performance implications.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Reminder:
Has the information provided helped? Is your issue resolved? Could you please let us know if you need any other information.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Shanmukh,
Sorry for the delay. We haven't tried this switch yet, but we tried upgrading the OMP and MKL libraries to the latest versions. We are still encountering this error intermittently.
I'm a bit hesitant to use the KMP_AFFINITY solution because our application is very performance intensive, and as you said, it may impact perf.
I did have a question though: Can you please let us know when this error occurs? Specifically, why the library thinks that the device is not functioning? Is it a mismatched CPU? or OS or a permissions issue?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gopal,
>>Can you please let us know when this error occurs? Specifically, why the library thinks that the device is not functioning? Is it a mismatched CPU? or OS or a permissions issue?
The issue is with respect to the CPU connected which might not be functioning properly as per the log mentioned by you.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Shanmukh,
Just want to confirm that the "KMP_AFFINITY disable" solution that you provided did solve the problem. I am not seeing a significant performance hit either, but am running more experiments to be sure.
I'm also following up with the service team regarding the bad CPU. Will update this post once I find something.
Thanks for all the help!
Gopal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gopal,
>>Just want to confirm that the "KMP_AFFINITY disable" solution that you provided did solve the problem. I am not seeing a significant performance hit either, but am running more experiments to be sure.
Thanks for the confirmation and sharing the details regarding the work arounds. Kindly let us know if we could close this thread at our end if this resolves your issue.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, please go ahead
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gopal,
Thanks for the confirmation! Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Best Regards,
Shanmukh.SS
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page