Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2275 Discussions

64 core simulation on quad socket Intel Xeon on Windows 2022 server does not start properly

Frank_R_1
New Contributor I
13,232 Views

Dear support,

We have a customer with the following computer and configuration:

4 x 18C Xeon Gold 6254 (4 x 18cores quad socket)
24 x 8GB RAM
Windows 2022 Server
4 NUMA domains occur in the task manager of Windows
Hyperthreding is disabled

We start our product with (Intel(R) MPI Library, Version 2021.7 Build 20220909)
mpiexec.exe -delegate -genvall -print-all-exitcodes -genv I_MPI_HYDRA_DEBUG 1 -genv I_MPI_DEBUG 500 -genv I_MPI_HYDRA_BSTRAP_KEEP_ALIVE 1 -genv I_MPI_CBWR 2 -genv I_MPI_ADJUST_GATHERV 3 -envall -localroot -n 64 #programpath

The problem is that 64 processes start on 64 cores distributed over 4 NUMA domains and immediately redistribute to only 2 NUMA domains and oversubscription.

Some output from MPI
[proxy:0:0@detorsrv007] Warning - oversubscription detected: 64 processes will be placed on 54 cores

[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 15736 detorsrv007 0
[0] MPI startup(): 1 14860 detorsrv007 1
[0] MPI startup(): 2 11560 detorsrv007 2
[0] MPI startup(): 3 9696 detorsrv007 3
[0] MPI startup(): 4 5788 detorsrv007 4
[0] MPI startup(): 5 12560 detorsrv007 5
[0] MPI startup(): 6 13628 detorsrv007 6
[0] MPI startup(): 7 16100 detorsrv007 7
[0] MPI startup(): 8 15540 detorsrv007 8
[0] MPI startup(): 9 15464 detorsrv007 9
[0] MPI startup(): 10 13716 detorsrv007 10
[0] MPI startup(): 11 12504 detorsrv007 11
[0] MPI startup(): 12 8796 detorsrv007 12
[0] MPI startup(): 13 8924 detorsrv007 13
[0] MPI startup(): 14 1168 detorsrv007 14
[0] MPI startup(): 15 13316 detorsrv007 15
[0] MPI startup(): 16 16212 detorsrv007 16
[0] MPI startup(): 17 14516 detorsrv007 17
[0] MPI startup(): 18 13844 detorsrv007 18
[0] MPI startup(): 19 12268 detorsrv007 19
[0] MPI startup(): 20 9208 detorsrv007 20
[0] MPI startup(): 21 14912 detorsrv007 21
[0] MPI startup(): 22 12760 detorsrv007 22
[0] MPI startup(): 23 12312 detorsrv007 23
[0] MPI startup(): 24 3856 detorsrv007 24
[0] MPI startup(): 25 2924 detorsrv007 25
[0] MPI startup(): 26 15036 detorsrv007 26
[0] MPI startup(): 27 13348 detorsrv007 27
[0] MPI startup(): 28 12316 detorsrv007 28
[0] MPI startup(): 29 15028 detorsrv007 29
[0] MPI startup(): 30 9316 detorsrv007 30
[0] MPI startup(): 31 2000 detorsrv007 31
[0] MPI startup(): 32 11196 detorsrv007 32
[0] MPI startup(): 33 11596 detorsrv007 33
[0] MPI startup(): 34 9640 detorsrv007 34
[0] MPI startup(): 35 14072 detorsrv007 35
[0] MPI startup(): 36 14504 detorsrv007 0
[0] MPI startup(): 37 13492 detorsrv007 1
[0] MPI startup(): 38 13084 detorsrv007 2
[0] MPI startup(): 39 9140 detorsrv007 3
[0] MPI startup(): 40 9084 detorsrv007 4
[0] MPI startup(): 41 12776 detorsrv007 5
[0] MPI startup(): 42 3908 detorsrv007 6
[0] MPI startup(): 43 4180 detorsrv007 7
[0] MPI startup(): 44 12232 detorsrv007 8
[0] MPI startup(): 45 15528 detorsrv007 9
[0] MPI startup(): 46 11816 detorsrv007 10
[0] MPI startup(): 47 14224 detorsrv007 11
[0] MPI startup(): 48 15864 detorsrv007 12
[0] MPI startup(): 49 13064 detorsrv007 13
[0] MPI startup(): 50 13456 detorsrv007 14
[0] MPI startup(): 51 12496 detorsrv007 15
[0] MPI startup(): 52 11672 detorsrv007 16
[0] MPI startup(): 53 9300 detorsrv007 17
[0] MPI startup(): 54 14868 detorsrv007 18
[0] MPI startup(): 55 1044 detorsrv007 19
[0] MPI startup(): 56 16340 detorsrv007 20
[0] MPI startup(): 57 1160 detorsrv007 21
[0] MPI startup(): 58 9432 detorsrv007 22
[0] MPI startup(): 59 7764 detorsrv007 23
[0] MPI startup(): 60 13524 detorsrv007 24
[0] MPI startup(): 61 15128 detorsrv007 25
[0] MPI startup(): 62 14988 detorsrv007 26
[0] MPI startup(): 63 11896 detorsrv007 27
[0] MPI startup(): I_MPI_HYDRA_DEBUG=1
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BSTRAP_KEEP_ALIVE=1
[0] MPI startup(): I_MPI_ADJUST_GATHERV=3
[0] MPI startup(): I_MPI_CBWR=2
[0] MPI startup(): I_MPI_DEBUG=500

Please see attached the full MPI debug output.

Using
-genv I_MPI_FABRICS shm
does not help to get it running 64 cores on 4 NUMA domains.

What can we do to properly run the simulation (it has to be 64 cores!)

Best regards

Frank

Labels (1)
0 Kudos
25 Replies
Frank_R_1
New Contributor I
1,318 Views

Hi,

 

Since our customer frequently ask for a solution, I kindly ask again if there will be a fix in the next Intel MPI 2021.12 library for this issue?

Also I am interested what exactly the problem is for Intel MPI on a four socket system on Windows.

 

Best regards

Frank

0 Kudos
TobiasK
Moderator
1,296 Views

@Frank_R_1 
this information is not publicly available.
The next release will have a reworked pinning infrastructure on Windows, we hope it resolves the issue. However, we cannot share a preview with forum users.

0 Kudos
Frank_R_1
New Contributor I
1,152 Views

Hi,

 

First of all many thanks to the developer team for fixing this issue!

We tried on dual socket Intel Xeon Platinum 2x36 cores and it worked out of the box with correct pinning, perfect!

We tried on quad socket Intel Xeon Gold 4x18 cores and it worked out of the box with correct pinning, perfect!

Even though the output of cpuinfo.exe on quad socket is not correct in my opinion:

Intel(R) processor family information utility, Version 2021.12 Build 20240213
Copyright (C) 2005-2024 Intel Corporation. All rights reserved.

===== Processor composition =====
Processor name : Intel(R) Xeon(R) Gold 6254
Packages(sockets) : 4
Cores : 54                                           <---------------------?
Processors(CPUs) : 72
Cores per package : 13                    <---------------------?
Threads per core : 1

===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
1 0 1 0
2 0 2 0
3 0 3 0
4 0 4 0
5 0 8 0
6 0 9 0
7 0 10 0
8 0 11 0
9 0 16 0
10 0 17 0
11 0 18 0
12 0 19 0
13 0 20 0
14 0 24 0
15 0 25 0
16 0 26 0
17 0 27 0
18 0 0 2
19 0 1 2
20 0 2 2
21 0 3 2
22 0 4 2
23 0 8 2
24 0 9 2
25 0 10 2
26 0 11 2
27 0 16 2
28 0 17 2
29 0 18 2
30 0 19 2
31 0 20 2
32 0 24 2
33 0 25 2
34 0 26 2
35 0 27 2
36 0 0 1
37 0 1 1
38 0 2 1
39 0 3 1
40 0 4 1
41 0 8 1
42 0 9 1
43 0 10 1
44 0 11 1
45 0 16 1
46 0 17 1
47 0 18 1
48 0 19 1
49 0 20 1
50 0 24 1
51 0 25 1
52 0 26 1
53 0 27 1
54 0 0 3
55 0 1 3
56 0 2 3
57 0 3 3
58 0 4 3
59 0 8 3
60 0 9 3
61 0 10 3
62 0 11 3
63 0 16 3
64 0 17 3
65 0 18 3
66 0 19 3
67 0 20 3
68 0 24 3
69 0 25 3
70 0 26 3
71 0 27 3
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,1,2,3,4,8,9,10,11,16,17,18,19,20,24,25,26,27 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
2 0,1,2,3,4,8,9,10,11,16,17,18,19,20,24,25,26,27 18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35
1 0,1,2,3,4,8,9,10,11,16,17,18,19,20,24,25,26,27 36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53
3 0,1,2,3,4,8,9,10,11,16,17,18,19,20,24,25,26,27 54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71

===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 1 MB no sharing
L3 24 MB (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)(18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35)(36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53)(54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71)

 

One last question you might not want to answer because it is about dual socket AMD Epyc 2x64 cores (Windows)

When sub numa is enabled in BIOS, Windows shows 2x4=8 NUMA Domains (which is correct by chip design) and the Intel MPI library does not start on more than 2 cores. Only after deaktivating sub num the library works correct.

When we boot the same machine on Linux with the sub numa enabled we get very good performance. On Windows this is not the case.

 

Nevertheless thanks again and best regards

Frank

0 Kudos
TobiasK
Moderator
1,148 Views

Hi @Frank_R_1 
thanks for the feedback.

Can you please attach the debug output of the AMD issue with and without SNC enabled?

Best
Tobias

0 Kudos
Frank_R_1
New Contributor I
1,072 Views

Hi,

 

Please find attached a zip file with some information:

1.txt run with 1 core

2.txt run with 2 cores

4.txt run with 4 cores

cpuinfo.txt output

cpu-z.txt output

Errors_8_16_32_64_128_processes.txt

 

Best regards

Frank

0 Kudos
Reply