Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2242 Discussions

invalid default process mapping and pinning on Windows with IntelMPI 2021.10

gdavid
Beginner
6,393 Views

Hello support,

 

I am investigating recurring problems on Windows Server 2019 and IntelMPI 2021.10.   mpiexec.exe consistently fails with the message below, but only at certain process counts.  On our node we are able to run with 13 and 15 processes, but not 14.   The behavior was first observed with a commercial CFD solver, but I have been able to reproduce identical behavior with the IMB-MPI1.exe benchmark.    The attached batch file and output shows the output of cpuinfo.exe and the behavior of the benchmark on 13, 15, and 14 slots.   

[mpiexec@eu-vw-num02] check_downstream_work_complition (mpiexec.c:1307): trying to close other downstreams
[mpiexec@eu-vw-num02] HYD_sock_write (..\windows\src\hydra_sock.c:387): write error (errno = 0)
[mpiexec@eu-vw-num02] wmain (mpiexec.c:2275): assert (pg->intel.exitcodes != NULL) failed
[mpiexec@eu-vw-num02] HYD_sock_write (..\windows\src\hydra_sock.c:387): write error (errno = 0)

I ran with I_MPI_DEBUG and at first glance the  pinning table seems suspicious.  See the output for -np 13 below.  

[0] MPI startup(): Rank    Pid      Node name    Pin cpu
[0] MPI startup(): 0       78096    eu-vw-num02  0,1,2,3,4,5,64,66,67,69,70,73,74,78,79
[0] MPI startup(): 1       50132    eu-vw-num02  6,7,8,9,10,11,64,66,68,69,71,73,74,78,79
[0] MPI startup(): 2       44600    eu-vw-num02  12,13,14,15,16,17,65,67,70,73,74,78,79
[0] MPI startup(): 3       77216    eu-vw-num02  18,19,20,21,22,23,66,67,68,69,70,73,74,78,79
[0] MPI startup(): 4       61268    eu-vw-num02  24,25,26,27,28,29,65,66,67,69,70,73,74,78,79
[0] MPI startup(): 5       81108    eu-vw-num02  30,31,32,33,34,35,65,66,67,68,70,73,74,78,79
[0] MPI startup(): 6       29248    eu-vw-num02  0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,64,67,69,70,73,74,78,79
[0] MPI startup(): 7       6256     eu-vw-num02  2,3,4,5,6,7,64,68,70,73,74,78,79
[0] MPI startup(): 8       31184    eu-vw-num02  8,9,10,11,12,13,66,68,69,71,73,74,78,79
[0] MPI startup(): 9       60324    eu-vw-num02  14,15,16,17,18,19,64,65,67,70,73,74,78,79
[0] MPI startup(): 10      53076    eu-vw-num02  20,21,22,23,24,25,64,65,66,67,69,70,73,74,78,79
[0] MPI startup(): 11      82048    eu-vw-num02  26,27,28,29,30,31,65,66,68,70,73,74,78,79
[0] MPI startup(): 12      34760    eu-vw-num02  32,33,34,35,36,37,64,65,66,67,68,70,73,74,78,79

Following this evidence I reran the previously failing -np 14 test with I_MPI_PIN_DOMAIN=socket.  This test works normally and the pinning table is reasonable.   Thus, my conclusion is that there is something wrong with the default pinning on Windows. 

 

Could you please confirm if this is a known issue and if the I_MPI_PIN_DOMAIN workaround is safe?  Some of our customers are encountering the same behavior  on their Windows machines. 

 

Thanks,

-David

0 Kudos
16 Replies
RabiyaSK_Intel
Employee
6,328 Views

HI,


Thanks for posting in Intel Communities.


We have informed the concerned development team. We will get back to you soon.


Thanks & Regards,

Shaik Rabiya



0 Kudos
RabiyaSK_Intel
Employee
6,249 Views

Hi,


Thank you for your patience.


>>>Could you please confirm if this is a known issue and if the I_MPI_PIN_DOMAIN workaround is safe? Some of our customers are encountering the same behavior on their Windows machines.

Explicitly pinning with environment variable I_MPI_PIN_DOMAIN=socket is a safe/valid workaround. Could you please specify why you feel that this workaround may not be safe?  


We are still investigating the issue regarding suspicious pinning and the error, we will get back to you soon.


Thanks & Regards,

Shaik Rabiya


0 Kudos
gdavid
Beginner
6,201 Views

Hello Shaik,

Thanks for the feedback.  I will continue with the I_MPI_PIN_DOMAIN workaround for the time being.  It seems safe to me as well, but as the default pinning was suspicious I just wanted to check if Intel was aware of the problem and if there was a better workaround. 

Please let me know if this issue is fixed in a future release and I will upgrade.

Thanks,

-David

0 Kudos
RabiyaSK_Intel
Employee
6,178 Views

Hi,


Could you please confirm if you are using a virtual machine as the number of sockets are more? If so, could you please provide the topology of your virtual machine, so that we could investigate your issue more closely?


Thanks & Regards,

Shaik Rabiya


0 Kudos
gdavid
Beginner
6,158 Views

Hello Shaik,

Yes, I believe this is a VM, although I am not 100% sure as I have never accessed this machine directly, only over remote desktop.   The details of the node topology is shown in the cpuinfo.exe output in my original message.   If there are any other details you need please let me know and i'll try to gather the informatoin.

-David

0 Kudos
RabiyaSK_Intel
Employee
6,146 Views

Hi,


I apologize for overlooking the topological information provided in the script. Thanks for confirming that the machine is a Virtual Machine. Could you please share some relevant details of the Virtual Machine.?


Thanks & Regards,

Shaik Rabiya


0 Kudos
RabiyaSK_Intel
Employee
6,073 Views

Hi,


Could you please respond to my earlier reply? We are awaiting for your response to continue investigating.


Thanks & Regards,

Shaik Rabiya


0 Kudos
gdavid
Beginner
6,053 Views

Hello Shaik,

 

Could you please elaborate on the information you need, besides the output of cpuinfo.exe. I am not an expert with Windows virtual machines so I am not sure what details you are looking for.

 

-David

0 Kudos
RabiyaSK_Intel
Employee
6,043 Views

Hi,


I apologize for causing confusion. You have to use the respective management tools of your virtualization platform either VMware Workstation, VirtualBox or Hyper-V to fetch the hardware details of your Virtual Machine. As the VM is of windows you can provide the output of the following command: systeminfo or you can press Win + R and enter msinfo32 and send a screenshot of system summary. 


Thanks & Regards,

Shaik Rabiya


0 Kudos
gdavid
Beginner
6,035 Views

Systeminfo output:

Host Name: EU-VW-NUM02
OS Name: Microsoft Windows Server 2019 Standard
OS Version: 10.0.17763 N/A Build 17763
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Server
OS Build Type: Multiprocessor Free
Registered Owner: Windows User
Registered Organization:
Product ID: 00429-70000-00000-AA641
Original Install Date: 5/24/2022, 3:34:17 PM
System Boot Time: 9/21/2023, 8:37:23 PM
System Manufacturer: Microsoft Corporation
System Model: Virtual Machine
System Type: x64-based PC
Processor(s): 14 Processor(s) Installed.
[01]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[02]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[03]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[04]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[05]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[06]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[07]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[08]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[09]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[10]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[11]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[12]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[13]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
[14]: Intel64 Family 6 Model 85 Stepping 7 GenuineIntel ~3093 Mhz
BIOS Version: Microsoft Corporation Hyper-V UEFI Release v4.0, 12/17/2019
Windows Directory: C:\Windows
System Directory: C:\Windows\system32
Boot Device: \Device\HarddiskVolume2
System Locale: en-us;English (United States)
Input Locale: en-us;English (United States)
Time Zone: (UTC+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna
Total Physical Memory: 319,999 MB
Available Physical Memory: 276,371 MB
Virtual Memory: Max Size: 336,177 MB
Virtual Memory: Available: 286,419 MB
Virtual Memory: In Use: 49,758 MB
Page File Location(s): C:\pagefile.sys
Domain: global.cadence.com
Logon Server: \\DCEUGLOBAL
Hotfix(s): 15 Hotfix(s) Installed.
[01]: KB5029925
[02]: KB4486153
[03]: KB4512577
[04]: KB4535680
[05]: KB4589208
[06]: KB5005112
[07]: KB5030214
[08]: KB5012675
[09]: KB5014797
[10]: KB5015896
[11]: KB5017400
[12]: KB5020374
[13]: KB5023789
[14]: KB5028316
[15]: KB5030505
Network Card(s): 1 NIC(s) Installed.
[01]: Microsoft Hyper-V Network Adapter
Connection Name: Ethernet
DHCP Enabled: No
IP address(es)
[01]: 10.160.88.207
[02]: fe80::ea25:862b:50ff:37a3
Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.

0 Kudos
RabiyaSK_Intel
Employee
5,790 Views

Hi,


Thank you for your patience. It seems to us that your Virtual Machine Software/Tool(Hypervisor) configuration may be a cause of the issue. Each core in your machine is being treated as a processor, hence in the cpuinfo IMPI is further identifying processors as sockets. The large number of sockets in the cpuinfo may infact be cores(i.e., 14). The logical cores on your system may be 80 cores if this is a two socket system which is given in processors(cpus) of your cpuinfo. All terms like processors, cpus, sockets are convoluted for different VM tools. Hence to confirm your socket count, could you please provide the screenshot of the Task Manager. Please follow these steps:

1. Press Ctrl + Shift + esc to open Task Manager

2. Click on Performance Tab

3. You can view your processor details like the number of sockets, cores, logical processors, etc.


According to your processor specification present in the following link https://ark.intel.com/content/www/us/en/ark/products/199352/intel-xeon-gold-6242r-processor-35-75m-cache-3-10-ghz.html.

It has 20 cores per socket. In Virtual Machines, not all cores have to be exposed, but we want to confirm that as the cores visible to us are just 14 instead of 20. Could you follow the steps mentioned in the below link to display and use all your cores:

https://www.intel.com/content/www/us/en/support/articles/000056742/processors.html


After changing your cores, could you please try executing your script and inform us your findings?


Thanks & Regards,

Shaik Rabiya



0 Kudos
gdavid
Beginner
5,611 Views

Hello Rabiya,

 

Apologies for the slow response. The output from the task manager -> Performance tab is shown below.

gdavid_0-1697664123992.png

 

Unfortunately I don't have admin access to this machine which is a shared resource in use by multiple groups.  I am not able to make any settings changes and I can't perform a reboot.   I checked the settings in the link you provided and it looks to be configured properly (Number of processors unchecked).

gdavid_1-1697664284189.png

Thanks,

-David

 

0 Kudos
RabiyaSK_Intel
Employee
5,646 Views

Hi.


We have not heard from you. Could you please respond to my previous reply?


Thanks & Regards,

Shaik Rabiya


0 Kudos
RabiyaSK_Intel
Employee
5,559 Views

Hi,

 

We regret to inform you that you may have to follow the entire documentation that we provided to check if this fixes the crash. Could you please try contacting your admin and have these changes made? If that's not possible you would have to use the workaround of I_MPI_PIN_DOMAIN=socket to run on your system. We apologize that we may have to close the thread if the VM configuration can't be made. Could we go ahead and close the thread in that case?

 

Thanks & Regards,

Shaik Rabiya

 

0 Kudos
gdavid
Beginner
5,536 Views

Hello Shaik,

 

Yes, please go ahead and close the bug.  I'm not all that interested in actually running anything on this particular node, it just was where I was able to reproduce behavior reported by other users.  My main goal with this is to make Intel aware of the problem so it can be better detected and fixed in future IMPI releases.

 

Thanks,

-David

0 Kudos
RabiyaSK_Intel
Employee
5,412 Views

Hi,

 

As per your confirmation we are going ahead and closing this thread. Thanks for your feedback, we have provided it to the concerned team. If you have any additional queries, you can raise a new question as this thread will no longer be monitored by Intel.

 

Thanks & Regards,

Shaik Rabiya

 

0 Kudos
Reply