- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello support,
I am investigating recurring problems on Windows Server 2019 and IntelMPI 2021.10. mpiexec.exe consistently fails with the message below, but only at certain process counts. On our node we are able to run with 13 and 15 processes, but not 14. The behavior was first observed with a commercial CFD solver, but I have been able to reproduce identical behavior with the IMB-MPI1.exe benchmark. The attached batch file and output shows the output of cpuinfo.exe and the behavior of the benchmark on 13, 15, and 14 slots.
[mpiexec@eu-vw-num02] check_downstream_work_complition (mpiexec.c:1307): trying to close other downstreams
[mpiexec@eu-vw-num02] HYD_sock_write (..\windows\src\hydra_sock.c:387): write error (errno = 0)
[mpiexec@eu-vw-num02] wmain (mpiexec.c:2275): assert (pg->intel.exitcodes != NULL) failed
[mpiexec@eu-vw-num02] HYD_sock_write (..\windows\src\hydra_sock.c:387): write error (errno = 0)
I ran with I_MPI_DEBUG and at first glance the pinning table seems suspicious. See the output for -np 13 below.
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 78096 eu-vw-num02 0,1,2,3,4,5,64,66,67,69,70,73,74,78,79
[0] MPI startup(): 1 50132 eu-vw-num02 6,7,8,9,10,11,64,66,68,69,71,73,74,78,79
[0] MPI startup(): 2 44600 eu-vw-num02 12,13,14,15,16,17,65,67,70,73,74,78,79
[0] MPI startup(): 3 77216 eu-vw-num02 18,19,20,21,22,23,66,67,68,69,70,73,74,78,79
[0] MPI startup(): 4 61268 eu-vw-num02 24,25,26,27,28,29,65,66,67,69,70,73,74,78,79
[0] MPI startup(): 5 81108 eu-vw-num02 30,31,32,33,34,35,65,66,67,68,70,73,74,78,79
[0] MPI startup(): 6 29248 eu-vw-num02 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,64,67,69,70,73,74,78,79
[0] MPI startup(): 7 6256 eu-vw-num02 2,3,4,5,6,7,64,68,70,73,74,78,79
[0] MPI startup(): 8 31184 eu-vw-num02 8,9,10,11,12,13,66,68,69,71,73,74,78,79
[0] MPI startup(): 9 60324 eu-vw-num02 14,15,16,17,18,19,64,65,67,70,73,74,78,79
[0] MPI startup(): 10 53076 eu-vw-num02 20,21,22,23,24,25,64,65,66,67,69,70,73,74,78,79
[0] MPI startup(): 11 82048 eu-vw-num02 26,27,28,29,30,31,65,66,68,70,73,74,78,79
[0] MPI startup(): 12 34760 eu-vw-num02 32,33,34,35,36,37,64,65,66,67,68,70,73,74,78,79
Following this evidence I reran the previously failing -np 14 test with I_MPI_PIN_DOMAIN=socket. This test works normally and the pinning table is reasonable. Thus, my conclusion is that there is something wrong with the default pinning on Windows.
Could you please confirm if this is a known issue and if the I_MPI_PIN_DOMAIN workaround is safe? Some of our customers are encountering the same behavior on their Windows machines.
Thanks,
-David
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HI,
Thanks for posting in Intel Communities.
We have informed the concerned development team. We will get back to you soon.
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for your patience.
>>>Could you please confirm if this is a known issue and if the I_MPI_PIN_DOMAIN workaround is safe? Some of our customers are encountering the same behavior on their Windows machines.
Explicitly pinning with environment variable I_MPI_PIN_DOMAIN=socket is a safe/valid workaround. Could you please specify why you feel that this workaround may not be safe?
We are still investigating the issue regarding suspicious pinning and the error, we will get back to you soon.
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Shaik,
Thanks for the feedback. I will continue with the I_MPI_PIN_DOMAIN workaround for the time being. It seems safe to me as well, but as the default pinning was suspicious I just wanted to check if Intel was aware of the problem and if there was a better workaround.
Please let me know if this issue is fixed in a future release and I will upgrade.
Thanks,
-David
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please confirm if you are using a virtual machine as the number of sockets are more? If so, could you please provide the topology of your virtual machine, so that we could investigate your issue more closely?
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Shaik,
Yes, I believe this is a VM, although I am not 100% sure as I have never accessed this machine directly, only over remote desktop. The details of the node topology is shown in the cpuinfo.exe output in my original message. If there are any other details you need please let me know and i'll try to gather the informatoin.
-David
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I apologize for overlooking the topological information provided in the script. Thanks for confirming that the machine is a Virtual Machine. Could you please share some relevant details of the Virtual Machine.?
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please respond to my earlier reply? We are awaiting for your response to continue investigating.
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Shaik,
Could you please elaborate on the information you need, besides the output of cpuinfo.exe. I am not an expert with Windows virtual machines so I am not sure what details you are looking for.
-David
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I apologize for causing confusion. You have to use the respective management tools of your virtualization platform either VMware Workstation, VirtualBox or Hyper-V to fetch the hardware details of your Virtual Machine. As the VM is of windows you can provide the output of the following command: systeminfo or you can press Win + R and enter msinfo32 and send a screenshot of system summary.
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Systeminfo output:
Host Name: EU-VW-NUM02
OS Name: Microsoft Windows Server 2019 Standard
OS Version: 10.0.17763 N/A Build 17763
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Server
OS Build Type: Multiprocessor Free
Registered Owner: Windows User
Registered Organization:
Product ID: 00429-70000-00000-AA641
Original Install Date: 5/24/2022, 3:34:17 PM
System Boot Time: 9/21/2023, 8:37:23 PM
System Manufacturer: Microsoft Corporation
System Model: Virtual Machine
System Type: x64-based PC
Processor(s): 14 Processor(s) Installed.
[01]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[02]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[03]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[04]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[05]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[06]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[07]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[08]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[09]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[10]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[11]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[12]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[13]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[14]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
BIOS Version: Microsoft Corporation Hyper-V UEFI Release v4.0, 12/17/2019
Windows Directory: C:\Windows
System Directory: C:\Windows\system32
Boot Device: \Device\HarddiskVolume2
System Locale: en-us;English (United States)
Input Locale: en-us;English (United States)
Time Zone: (UTC+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna
Total Physical Memory: 319,999 MB
Available Physical Memory: 276,371 MB
Virtual Memory: Max Size: 336,177 MB
Virtual Memory: Available: 286,419 MB
Virtual Memory: In Use: 49,758 MB
Page File Location(s): C:\pagefile.sys
Domain: global.cadence.com
Logon Server: \\DCEUGLOBAL
Hotfix(s): 15 Hotfix(s) Installed.
[01]: KB5029925
[02]: KB4486153
[03]: KB4512577
[04]: KB4535680
[05]: KB4589208
[06]: KB5005112
[07]: KB5030214
[08]: KB5012675
[09]: KB5014797
[10]: KB5015896
[11]: KB5017400
[12]: KB5020374
[13]: KB5023789
[14]: KB5028316
[15]: KB5030505
Network Card(s): 1 NIC(s) Installed.
[01]: Microsoft Hyper-V Network Adapter
Connection Name: Ethernet
DHCP Enabled: No
IP address(es)
[01]: 10.160.88.207
[02]: fe80::ea25:862b:50ff:37a3
Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for your patience. It seems to us that your Virtual Machine Software/Tool(Hypervisor) configuration may be a cause of the issue. Each core in your machine is being treated as a processor, hence in the cpuinfo IMPI is further identifying processors as sockets. The large number of sockets in the cpuinfo may infact be cores(i.e., 14). The logical cores on your system may be 80 cores if this is a two socket system which is given in processors(cpus) of your cpuinfo. All terms like processors, cpus, sockets are convoluted for different VM tools. Hence to confirm your socket count, could you please provide the screenshot of the Task Manager. Please follow these steps:
1. Press Ctrl + Shift + esc to open Task Manager
2. Click on Performance Tab
3. You can view your processor details like the number of sockets, cores, logical processors, etc.
According to your processor specification present in the following link https://ark.intel.com/content/www/us/en/ark/products/199352/intel-xeon-gold-6242r-processor-35-75m-cache-3-10-ghz.html.
It has 20 cores per socket. In Virtual Machines, not all cores have to be exposed, but we want to confirm that as the cores visible to us are just 14 instead of 20. Could you follow the steps mentioned in the below link to display and use all your cores:
https://www.intel.com/content/www/us/en/support/articles/000056742/processors.html
After changing your cores, could you please try executing your script and inform us your findings?
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Rabiya,
Apologies for the slow response. The output from the task manager -> Performance tab is shown below.
Unfortunately I don't have admin access to this machine which is a shared resource in use by multiple groups. I am not able to make any settings changes and I can't perform a reboot. I checked the settings in the link you provided and it looks to be configured properly (Number of processors unchecked).
Thanks,
-David
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi.
We have not heard from you. Could you please respond to my previous reply?
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We regret to inform you that you may have to follow the entire documentation that we provided to check if this fixes the crash. Could you please try contacting your admin and have these changes made? If that's not possible you would have to use the workaround of I_MPI_PIN_DOMAIN=socket to run on your system. We apologize that we may have to close the thread if the VM configuration can't be made. Could we go ahead and close the thread in that case?
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Shaik,
Yes, please go ahead and close the bug. I'm not all that interested in actually running anything on this particular node, it just was where I was able to reproduce behavior reported by other users. My main goal with this is to make Intel aware of the problem so it can be better detected and fixed in future IMPI releases.
Thanks,
-David
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As per your confirmation we are going ahead and closing this thread. Thanks for your feedback, we have provided it to the concerned team. If you have any additional queries, you can raise a new question as this thread will no longer be monitored by Intel.
Thanks & Regards,
Shaik Rabiya

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page