Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2183 Discussions

Intel MPI job not starting between two nodes

paul312
Beginner
1,261 Views

I have a cluster of six Dell R650 with infiniband between the nodes. I am running the latest release of the Intel Basekit and HPCkit installed using DNF. I am confused why a job is not running between two nodes and would love some help in debugging the problem. I have a large code (Vasp) compiled using the oneapi compiler suite and the code works fine and passes all internal tests. When I run a job on a node (neutrino), the code runs fine. The program also runs fine on the second node in question (pion).  The command "mpiexec.hydra -n 64 -hosts neutrino,pion hostname" comes back with the hostnames for all 64 nodes as expected.  The Intel MPI test code (test.f90) also runs on both nodes. Here is the output

 

(pymatgen) paulfons@neutrino:~/test>mpiexec.hydra -n 64 -host localhost,pion-ib  -genv I_MPI_DEBUG 5 ./testf90
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10  Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_skx_shm-ofi_mlx_56.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_skx_shm-ofi.dat"
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       504495   neutrino   0
[0] MPI startup(): 1       504496   neutrino   2
[0] MPI startup(): 2       504497   neutrino   4
[0] MPI startup(): 3       504498   neutrino   6
[0] MPI startup(): 4       504499   neutrino   8
[0] MPI startup(): 5       504500   neutrino   10
[0] MPI startup(): 6       504501   neutrino   12
[0] MPI startup(): 7       504502   neutrino   14
[0] MPI startup(): 8       504503   neutrino   16
[0] MPI startup(): 9       504504   neutrino   18
[0] MPI startup(): 10      504505   neutrino   20
[0] MPI startup(): 11      504506   neutrino   22
[0] MPI startup(): 12      504507   neutrino   24
[0] MPI startup(): 13      504508   neutrino   26
[0] MPI startup(): 14      504509   neutrino   28
[0] MPI startup(): 15      504510   neutrino   30
[0] MPI startup(): 16      504511   neutrino   1
[0] MPI startup(): 17      504512   neutrino   3
[0] MPI startup(): 18      504513   neutrino   5
[0] MPI startup(): 19      504514   neutrino   7
[0] MPI startup(): 20      504515   neutrino   9
[0] MPI startup(): 21      504516   neutrino   11
[0] MPI startup(): 22      504517   neutrino   13
[0] MPI startup(): 23      504518   neutrino   15
[0] MPI startup(): 24      504519   neutrino   17
[0] MPI startup(): 25      504520   neutrino   19
[0] MPI startup(): 26      504521   neutrino   21
[0] MPI startup(): 27      504522   neutrino   23
[0] MPI startup(): 28      504523   neutrino   25
[0] MPI startup(): 29      504524   neutrino   27
[0] MPI startup(): 30      504525   neutrino   29
[0] MPI startup(): 31      504526   neutrino   31
[0] MPI startup(): 32      549528   pion       1
[0] MPI startup(): 33      549529   pion       3
[0] MPI startup(): 34      549530   pion       5
[0] MPI startup(): 35      549531   pion       7
[0] MPI startup(): 36      549532   pion       9
[0] MPI startup(): 37      549533   pion       11
[0] MPI startup(): 38      549534   pion       13
[0] MPI startup(): 39      549535   pion       15
[0] MPI startup(): 40      549536   pion       17
[0] MPI startup(): 41      549537   pion       19
[0] MPI startup(): 42      549538   pion       21
[0] MPI startup(): 43      549539   pion       23
[0] MPI startup(): 44      549540   pion       25
[0] MPI startup(): 45      549541   pion       27
[0] MPI startup(): 46      549542   pion       29
[0] MPI startup(): 47      549543   pion       31
[0] MPI startup(): 48      549544   pion       0
[0] MPI startup(): 49      549545   pion       2
[0] MPI startup(): 50      549546   pion       4
[0] MPI startup(): 51      549547   pion       6
[0] MPI startup(): 52      549548   pion       8
[0] MPI startup(): 53      549549   pion       10
[0] MPI startup(): 54      549550   pion       12
[0] MPI startup(): 55      549551   pion       14
[0] MPI startup(): 56      549552   pion       16
[0] MPI startup(): 57      549553   pion       18
[0] MPI startup(): 58      549554   pion       20
[0] MPI startup(): 59      549555   pion       22
[0] MPI startup(): 60      549556   pion       24
[0] MPI startup(): 61      549557   pion       26
[0] MPI startup(): 62      549558   pion       28
[0] MPI startup(): 63      549559   pion       30
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_COMPATIBILITY=4
 Hello world: rank            0  of           64  running on 
 neutrino                                                                       
                                                 
 Hello world: rank            1  of           64  running on 
 neutrino                                                                       
                                                 
 Hello world: rank            2  of           64  running on 
 neutrino                                                                       

skipped lines ...
 Hello world: rank           62  of           64  running on 
 pion                                                                           
                                                 
 Hello world: rank           63  of           64  running on 
 pion    
 

 

The Vasp code runs fine on neutrino. Here is the debugged output (clipped right after Vasp starts normal output)

 

(pymatgen) paulfons@neutrino:/data/Vasp/GaAs>mpiexec.hydra -n 32 -genv I_MPI_DEBUG 5  -host pion  vasp_ncl
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10  Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_icx_shm-ofi_mlx_56.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_icx_shm-ofi_mlx.dat"
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       552541   pion       1
[0] MPI startup(): 1       552542   pion       3
[0] MPI startup(): 2       552543   pion       5
[0] MPI startup(): 3       552544   pion       7
[0] MPI startup(): 4       552545   pion       9
[0] MPI startup(): 5       552546   pion       11
[0] MPI startup(): 6       552547   pion       13
[0] MPI startup(): 7       552548   pion       15
[0] MPI startup(): 8       552549   pion       17
[0] MPI startup(): 9       552550   pion       19
[0] MPI startup(): 10      552551   pion       21
[0] MPI startup(): 11      552552   pion       23
[0] MPI startup(): 12      552553   pion       25
[0] MPI startup(): 13      552554   pion       27
[0] MPI startup(): 14      552555   pion       29
[0] MPI startup(): 15      552556   pion       31
[0] MPI startup(): 16      552557   pion       0
[0] MPI startup(): 17      552558   pion       2
[0] MPI startup(): 18      552559   pion       4
[0] MPI startup(): 19      552560   pion       6
[0] MPI startup(): 20      552561   pion       8
[0] MPI startup(): 21      552562   pion       10
[0] MPI startup(): 22      552563   pion       12
[0] MPI startup(): 23      552564   pion       14
[0] MPI startup(): 24      552565   pion       16
[0] MPI startup(): 25      552566   pion       18
[0] MPI startup(): 26      552567   pion       20
[0] MPI startup(): 27      552568   pion       22
[0] MPI startup(): 28      552569   pion       24
[0] MPI startup(): 29      552570   pion       26
[0] MPI startup(): 30      552571   pion       28
[0] MPI startup(): 31      552572   pion       30
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=1
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_COMPATIBILITY=4
 running   32 mpi-ranks, on    1 nodes
 distrk:  each k-point on    2 cores,   16 groups
 distr:  one band on    1 cores,    2 groups
 vasp.6.4.2 20Jul23 (build Oct 13 2023 16:03:31) complex                        
  
 POSCAR found type information on POSCAR GaAs
 POSCAR found :  2 types and       2 ions

 

The code also runs fine running remotely on pion as started from neutrino

 

(pymatgen) paulfons@neutrino:/data/Vasp/GaAs>mpiexec.hydra -n 32 -genv I_MPI_DEBUG 5  -host pion  vasp_ncl
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10  Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_icx_shm-ofi_mlx_56.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_icx_shm-ofi_mlx.dat"
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       552415   pion       1
[0] MPI startup(): 1       552416   pion       3
[0] MPI startup(): 2       552417   pion       5
[0] MPI startup(): 3       552418   pion       7
[0] MPI startup(): 4       552419   pion       9
[0] MPI startup(): 5       552420   pion       11
[0] MPI startup(): 6       552421   pion       13
[0] MPI startup(): 7       552422   pion       15
[0] MPI startup(): 8       552423   pion       17
[0] MPI startup(): 9       552424   pion       19
[0] MPI startup(): 10      552425   pion       21
[0] MPI startup(): 11      552426   pion       23
[0] MPI startup(): 12      552427   pion       25
[0] MPI startup(): 13      552428   pion       27
[0] MPI startup(): 14      552429   pion       29
[0] MPI startup(): 15      552430   pion       31
[0] MPI startup(): 16      552431   pion       0
[0] MPI startup(): 17      552432   pion       2
[0] MPI startup(): 18      552433   pion       4
[0] MPI startup(): 19      552434   pion       6
[0] MPI startup(): 20      552435   pion       8
[0] MPI startup(): 21      552436   pion       10
[0] MPI startup(): 22      552437   pion       12
[0] MPI startup(): 23      552438   pion       14
[0] MPI startup(): 24      552439   pion       16
[0] MPI startup(): 25      552440   pion       18
[0] MPI startup(): 26      552441   pion       20
[0] MPI startup(): 27      552442   pion       22
[0] MPI startup(): 28      552443   pion       24
[0] MPI startup(): 29      552444   pion       26
[0] MPI startup(): 30      552445   pion       28
[0] MPI startup(): 31      552446   pion       30
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=1
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_COMPATIBILITY=4
 running   32 mpi-ranks, on    1 nodes
 distrk:  each k-point on    2 cores,   16 groups
 distr:  one band on    1 cores,    2 groups
 vasp.6.4.2 20Jul23 (build Oct 13 2023 16:03:31) complex                        
  
 POSCAR found type information on POSCAR GaAs
 POSCAR found :  2 types and       2 ions

 

When I try to run the code on both nodes using "mpiexec.hydra -n 64 -hosts localhost,pion vasp_ncl", the code apparently gets stuck. I can see there are 32 processes running on both neutrino and pion, but the code seems to hang. Any idea of what the cause might be and how to address it. Note that the same code runs fine between other nodes. In addition, the (identical) intel oneapi system was installed on each node immediately after updating the nodes last week. This is a really strange problem and has me baffled.

 

(pymatgen) paulfons@neutrino:/data/Vasp/GaAs>mpiexec.hydra -n 64 -genv I_MPI_DEBUG 5  -host neutrino,pion  vasp_ncl
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10  Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_skx_shm-ofi_mlx_56.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_skx_shm-ofi.dat"
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       505697   neutrino   0
[0] MPI startup(): 1       505698   neutrino   2
[0] MPI startup(): 2       505699   neutrino   4
[0] MPI startup(): 3       505700   neutrino   6
[0] MPI startup(): 4       505701   neutrino   8
[0] MPI startup(): 5       505702   neutrino   10
[0] MPI startup(): 6       505703   neutrino   12
[0] MPI startup(): 7       505704   neutrino   14
[0] MPI startup(): 8       505705   neutrino   16
[0] MPI startup(): 9       505706   neutrino   18
[0] MPI startup(): 10      505707   neutrino   20
[0] MPI startup(): 11      505708   neutrino   22
[0] MPI startup(): 12      505709   neutrino   24
[0] MPI startup(): 13      505710   neutrino   26
[0] MPI startup(): 14      505711   neutrino   28
[0] MPI startup(): 15      505712   neutrino   30
[0] MPI startup(): 16      505713   neutrino   1
[0] MPI startup(): 17      505714   neutrino   3
[0] MPI startup(): 18      505715   neutrino   5
[0] MPI startup(): 19      505716   neutrino   7
[0] MPI startup(): 20      505717   neutrino   9
[0] MPI startup(): 21      505718   neutrino   11
[0] MPI startup(): 22      505719   neutrino   13
[0] MPI startup(): 23      505720   neutrino   15
[0] MPI startup(): 24      505721   neutrino   17
[0] MPI startup(): 25      505722   neutrino   19
[0] MPI startup(): 26      505723   neutrino   21
[0] MPI startup(): 27      505724   neutrino   23
[0] MPI startup(): 28      505725   neutrino   25
[0] MPI startup(): 29      505726   neutrino   27
[0] MPI startup(): 30      505727   neutrino   29
[0] MPI startup(): 31      505728   neutrino   31
[0] MPI startup(): 32      552657   pion       1
[0] MPI startup(): 33      552658   pion       3
[0] MPI startup(): 34      552659   pion       5
[0] MPI startup(): 35      552660   pion       7
[0] MPI startup(): 36      552661   pion       9
[0] MPI startup(): 37      552662   pion       11
[0] MPI startup(): 38      552663   pion       13
[0] MPI startup(): 39      552664   pion       15
[0] MPI startup(): 40      552665   pion       17
[0] MPI startup(): 41      552666   pion       19
[0] MPI startup(): 42      552667   pion       21
[0] MPI startup(): 43      552668   pion       23
[0] MPI startup(): 44      552669   pion       25
[0] MPI startup(): 45      552670   pion       27
[0] MPI startup(): 46      552671   pion       29
[0] MPI startup(): 47      552672   pion       31
[0] MPI startup(): 48      552673   pion       0
[0] MPI startup(): 49      552674   pion       2
[0] MPI startup(): 50      552675   pion       4
[0] MPI startup(): 51      552676   pion       6
[0] MPI startup(): 52      552677   pion       8
[0] MPI startup(): 53      552678   pion       10
[0] MPI startup(): 54      552679   pion       12
[0] MPI startup(): 55      552680   pion       14
[0] MPI startup(): 56      552681   pion       16
[0] MPI startup(): 57      552682   pion       18
[0] MPI startup(): 58      552683   pion       20
[0] MPI startup(): 59      552684   pion       22
[0] MPI startup(): 60      552685   pion       24
[0] MPI startup(): 61      552686   pion       26
[0] MPI startup(): 62      552687   pion       28
[0] MPI startup(): 63      552688   pion       30
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_COMPATIBILITY=4

 

0 Kudos
3 Replies
AishwaryaCV_Intel
Moderator
1,213 Views

Hi,

 

Thank you for posting in Intel communities.

 

We have tried with simple hello world program and was successfully able to run on two nodes.

 

Could you please let us know the following details , so that we can reproduce the issue at our end:

1. OS and CPU details.

2. Sample reproducer and steps to reproduce.

3. Output of -lscpu command.

 

Please take a note that to execute the program on multiple hosts, you need to utilize the '-hosts' flag, and '-ppn' can be employed to specify the number of processes per node, as demonstrated below:

 

mpiexec.hydra -n 64 -ppn 2 -hosts localhost,pion-ib -genv I_MPI_DEBUG 5 ./testf90

 

 

Thanks and regards,

Aishwarya

 

0 Kudos
AishwaryaCV_Intel
Moderator
1,143 Views

Hi,


We have not heard back from you, could you please provide information we asked in the previous response?


Thanks and regards,

Aishwarya


0 Kudos
AishwaryaCV_Intel
Moderator
1,085 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks and regards,

Aishwarya


0 Kudos
Reply