- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a cluster of six Dell R650 with infiniband between the nodes. I am running the latest release of the Intel Basekit and HPCkit installed using DNF. I am confused why a job is not running between two nodes and would love some help in debugging the problem. I have a large code (Vasp) compiled using the oneapi compiler suite and the code works fine and passes all internal tests. When I run a job on a node (neutrino), the code runs fine. The program also runs fine on the second node in question (pion). The command "mpiexec.hydra -n 64 -hosts neutrino,pion hostname" comes back with the hostnames for all 64 nodes as expected. The Intel MPI test code (test.f90) also runs on both nodes. Here is the output
(pymatgen) paulfons@neutrino:~/test>mpiexec.hydra -n 64 -host localhost,pion-ib -genv I_MPI_DEBUG 5 ./testf90
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_skx_shm-ofi_mlx_56.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_skx_shm-ofi.dat"
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 504495 neutrino 0
[0] MPI startup(): 1 504496 neutrino 2
[0] MPI startup(): 2 504497 neutrino 4
[0] MPI startup(): 3 504498 neutrino 6
[0] MPI startup(): 4 504499 neutrino 8
[0] MPI startup(): 5 504500 neutrino 10
[0] MPI startup(): 6 504501 neutrino 12
[0] MPI startup(): 7 504502 neutrino 14
[0] MPI startup(): 8 504503 neutrino 16
[0] MPI startup(): 9 504504 neutrino 18
[0] MPI startup(): 10 504505 neutrino 20
[0] MPI startup(): 11 504506 neutrino 22
[0] MPI startup(): 12 504507 neutrino 24
[0] MPI startup(): 13 504508 neutrino 26
[0] MPI startup(): 14 504509 neutrino 28
[0] MPI startup(): 15 504510 neutrino 30
[0] MPI startup(): 16 504511 neutrino 1
[0] MPI startup(): 17 504512 neutrino 3
[0] MPI startup(): 18 504513 neutrino 5
[0] MPI startup(): 19 504514 neutrino 7
[0] MPI startup(): 20 504515 neutrino 9
[0] MPI startup(): 21 504516 neutrino 11
[0] MPI startup(): 22 504517 neutrino 13
[0] MPI startup(): 23 504518 neutrino 15
[0] MPI startup(): 24 504519 neutrino 17
[0] MPI startup(): 25 504520 neutrino 19
[0] MPI startup(): 26 504521 neutrino 21
[0] MPI startup(): 27 504522 neutrino 23
[0] MPI startup(): 28 504523 neutrino 25
[0] MPI startup(): 29 504524 neutrino 27
[0] MPI startup(): 30 504525 neutrino 29
[0] MPI startup(): 31 504526 neutrino 31
[0] MPI startup(): 32 549528 pion 1
[0] MPI startup(): 33 549529 pion 3
[0] MPI startup(): 34 549530 pion 5
[0] MPI startup(): 35 549531 pion 7
[0] MPI startup(): 36 549532 pion 9
[0] MPI startup(): 37 549533 pion 11
[0] MPI startup(): 38 549534 pion 13
[0] MPI startup(): 39 549535 pion 15
[0] MPI startup(): 40 549536 pion 17
[0] MPI startup(): 41 549537 pion 19
[0] MPI startup(): 42 549538 pion 21
[0] MPI startup(): 43 549539 pion 23
[0] MPI startup(): 44 549540 pion 25
[0] MPI startup(): 45 549541 pion 27
[0] MPI startup(): 46 549542 pion 29
[0] MPI startup(): 47 549543 pion 31
[0] MPI startup(): 48 549544 pion 0
[0] MPI startup(): 49 549545 pion 2
[0] MPI startup(): 50 549546 pion 4
[0] MPI startup(): 51 549547 pion 6
[0] MPI startup(): 52 549548 pion 8
[0] MPI startup(): 53 549549 pion 10
[0] MPI startup(): 54 549550 pion 12
[0] MPI startup(): 55 549551 pion 14
[0] MPI startup(): 56 549552 pion 16
[0] MPI startup(): 57 549553 pion 18
[0] MPI startup(): 58 549554 pion 20
[0] MPI startup(): 59 549555 pion 22
[0] MPI startup(): 60 549556 pion 24
[0] MPI startup(): 61 549557 pion 26
[0] MPI startup(): 62 549558 pion 28
[0] MPI startup(): 63 549559 pion 30
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_COMPATIBILITY=4
Hello world: rank 0 of 64 running on
neutrino
Hello world: rank 1 of 64 running on
neutrino
Hello world: rank 2 of 64 running on
neutrino
skipped lines ...
Hello world: rank 62 of 64 running on
pion
Hello world: rank 63 of 64 running on
pion
The Vasp code runs fine on neutrino. Here is the debugged output (clipped right after Vasp starts normal output)
(pymatgen) paulfons@neutrino:/data/Vasp/GaAs>mpiexec.hydra -n 32 -genv I_MPI_DEBUG 5 -host pion vasp_ncl
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_icx_shm-ofi_mlx_56.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_icx_shm-ofi_mlx.dat"
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 552541 pion 1
[0] MPI startup(): 1 552542 pion 3
[0] MPI startup(): 2 552543 pion 5
[0] MPI startup(): 3 552544 pion 7
[0] MPI startup(): 4 552545 pion 9
[0] MPI startup(): 5 552546 pion 11
[0] MPI startup(): 6 552547 pion 13
[0] MPI startup(): 7 552548 pion 15
[0] MPI startup(): 8 552549 pion 17
[0] MPI startup(): 9 552550 pion 19
[0] MPI startup(): 10 552551 pion 21
[0] MPI startup(): 11 552552 pion 23
[0] MPI startup(): 12 552553 pion 25
[0] MPI startup(): 13 552554 pion 27
[0] MPI startup(): 14 552555 pion 29
[0] MPI startup(): 15 552556 pion 31
[0] MPI startup(): 16 552557 pion 0
[0] MPI startup(): 17 552558 pion 2
[0] MPI startup(): 18 552559 pion 4
[0] MPI startup(): 19 552560 pion 6
[0] MPI startup(): 20 552561 pion 8
[0] MPI startup(): 21 552562 pion 10
[0] MPI startup(): 22 552563 pion 12
[0] MPI startup(): 23 552564 pion 14
[0] MPI startup(): 24 552565 pion 16
[0] MPI startup(): 25 552566 pion 18
[0] MPI startup(): 26 552567 pion 20
[0] MPI startup(): 27 552568 pion 22
[0] MPI startup(): 28 552569 pion 24
[0] MPI startup(): 29 552570 pion 26
[0] MPI startup(): 30 552571 pion 28
[0] MPI startup(): 31 552572 pion 30
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=1
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_COMPATIBILITY=4
running 32 mpi-ranks, on 1 nodes
distrk: each k-point on 2 cores, 16 groups
distr: one band on 1 cores, 2 groups
vasp.6.4.2 20Jul23 (build Oct 13 2023 16:03:31) complex
POSCAR found type information on POSCAR GaAs
POSCAR found : 2 types and 2 ions
The code also runs fine running remotely on pion as started from neutrino
(pymatgen) paulfons@neutrino:/data/Vasp/GaAs>mpiexec.hydra -n 32 -genv I_MPI_DEBUG 5 -host pion vasp_ncl
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_icx_shm-ofi_mlx_56.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_icx_shm-ofi_mlx.dat"
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 552415 pion 1
[0] MPI startup(): 1 552416 pion 3
[0] MPI startup(): 2 552417 pion 5
[0] MPI startup(): 3 552418 pion 7
[0] MPI startup(): 4 552419 pion 9
[0] MPI startup(): 5 552420 pion 11
[0] MPI startup(): 6 552421 pion 13
[0] MPI startup(): 7 552422 pion 15
[0] MPI startup(): 8 552423 pion 17
[0] MPI startup(): 9 552424 pion 19
[0] MPI startup(): 10 552425 pion 21
[0] MPI startup(): 11 552426 pion 23
[0] MPI startup(): 12 552427 pion 25
[0] MPI startup(): 13 552428 pion 27
[0] MPI startup(): 14 552429 pion 29
[0] MPI startup(): 15 552430 pion 31
[0] MPI startup(): 16 552431 pion 0
[0] MPI startup(): 17 552432 pion 2
[0] MPI startup(): 18 552433 pion 4
[0] MPI startup(): 19 552434 pion 6
[0] MPI startup(): 20 552435 pion 8
[0] MPI startup(): 21 552436 pion 10
[0] MPI startup(): 22 552437 pion 12
[0] MPI startup(): 23 552438 pion 14
[0] MPI startup(): 24 552439 pion 16
[0] MPI startup(): 25 552440 pion 18
[0] MPI startup(): 26 552441 pion 20
[0] MPI startup(): 27 552442 pion 22
[0] MPI startup(): 28 552443 pion 24
[0] MPI startup(): 29 552444 pion 26
[0] MPI startup(): 30 552445 pion 28
[0] MPI startup(): 31 552446 pion 30
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=1
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_COMPATIBILITY=4
running 32 mpi-ranks, on 1 nodes
distrk: each k-point on 2 cores, 16 groups
distr: one band on 1 cores, 2 groups
vasp.6.4.2 20Jul23 (build Oct 13 2023 16:03:31) complex
POSCAR found type information on POSCAR GaAs
POSCAR found : 2 types and 2 ions
When I try to run the code on both nodes using "mpiexec.hydra -n 64 -hosts localhost,pion vasp_ncl", the code apparently gets stuck. I can see there are 32 processes running on both neutrino and pion, but the code seems to hang. Any idea of what the cause might be and how to address it. Note that the same code runs fine between other nodes. In addition, the (identical) intel oneapi system was installed on each node immediately after updating the nodes last week. This is a really strange problem and has me baffled.
(pymatgen) paulfons@neutrino:/data/Vasp/GaAs>mpiexec.hydra -n 64 -genv I_MPI_DEBUG 5 -host neutrino,pion vasp_ncl
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_skx_shm-ofi_mlx_56.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.10.0/etc/tuning_skx_shm-ofi.dat"
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 505697 neutrino 0
[0] MPI startup(): 1 505698 neutrino 2
[0] MPI startup(): 2 505699 neutrino 4
[0] MPI startup(): 3 505700 neutrino 6
[0] MPI startup(): 4 505701 neutrino 8
[0] MPI startup(): 5 505702 neutrino 10
[0] MPI startup(): 6 505703 neutrino 12
[0] MPI startup(): 7 505704 neutrino 14
[0] MPI startup(): 8 505705 neutrino 16
[0] MPI startup(): 9 505706 neutrino 18
[0] MPI startup(): 10 505707 neutrino 20
[0] MPI startup(): 11 505708 neutrino 22
[0] MPI startup(): 12 505709 neutrino 24
[0] MPI startup(): 13 505710 neutrino 26
[0] MPI startup(): 14 505711 neutrino 28
[0] MPI startup(): 15 505712 neutrino 30
[0] MPI startup(): 16 505713 neutrino 1
[0] MPI startup(): 17 505714 neutrino 3
[0] MPI startup(): 18 505715 neutrino 5
[0] MPI startup(): 19 505716 neutrino 7
[0] MPI startup(): 20 505717 neutrino 9
[0] MPI startup(): 21 505718 neutrino 11
[0] MPI startup(): 22 505719 neutrino 13
[0] MPI startup(): 23 505720 neutrino 15
[0] MPI startup(): 24 505721 neutrino 17
[0] MPI startup(): 25 505722 neutrino 19
[0] MPI startup(): 26 505723 neutrino 21
[0] MPI startup(): 27 505724 neutrino 23
[0] MPI startup(): 28 505725 neutrino 25
[0] MPI startup(): 29 505726 neutrino 27
[0] MPI startup(): 30 505727 neutrino 29
[0] MPI startup(): 31 505728 neutrino 31
[0] MPI startup(): 32 552657 pion 1
[0] MPI startup(): 33 552658 pion 3
[0] MPI startup(): 34 552659 pion 5
[0] MPI startup(): 35 552660 pion 7
[0] MPI startup(): 36 552661 pion 9
[0] MPI startup(): 37 552662 pion 11
[0] MPI startup(): 38 552663 pion 13
[0] MPI startup(): 39 552664 pion 15
[0] MPI startup(): 40 552665 pion 17
[0] MPI startup(): 41 552666 pion 19
[0] MPI startup(): 42 552667 pion 21
[0] MPI startup(): 43 552668 pion 23
[0] MPI startup(): 44 552669 pion 25
[0] MPI startup(): 45 552670 pion 27
[0] MPI startup(): 46 552671 pion 29
[0] MPI startup(): 47 552672 pion 31
[0] MPI startup(): 48 552673 pion 0
[0] MPI startup(): 49 552674 pion 2
[0] MPI startup(): 50 552675 pion 4
[0] MPI startup(): 51 552676 pion 6
[0] MPI startup(): 52 552677 pion 8
[0] MPI startup(): 53 552678 pion 10
[0] MPI startup(): 54 552679 pion 12
[0] MPI startup(): 55 552680 pion 14
[0] MPI startup(): 56 552681 pion 16
[0] MPI startup(): 57 552682 pion 18
[0] MPI startup(): 58 552683 pion 20
[0] MPI startup(): 59 552684 pion 22
[0] MPI startup(): 60 552685 pion 24
[0] MPI startup(): 61 552686 pion 26
[0] MPI startup(): 62 552687 pion 28
[0] MPI startup(): 63 552688 pion 30
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.10.0
[0] MPI startup(): ONEAPI_ROOT=/opt/intel/oneapi
[0] MPI startup(): I_MPI_BIND_WIN_ALLOCATE=localalloc
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_RETURN_WIN_MEM_NUMA=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_COMPATIBILITY=4
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel communities.
We have tried with simple hello world program and was successfully able to run on two nodes.
Could you please let us know the following details , so that we can reproduce the issue at our end:
1. OS and CPU details.
2. Sample reproducer and steps to reproduce.
3. Output of -lscpu command.
Please take a note that to execute the program on multiple hosts, you need to utilize the '-hosts' flag, and '-ppn' can be employed to specify the number of processes per node, as demonstrated below:
mpiexec.hydra -n 64 -ppn 2 -hosts localhost,pion-ib -genv I_MPI_DEBUG 5 ./testf90
Thanks and regards,
Aishwarya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you, could you please provide information we asked in the previous response?
Thanks and regards,
Aishwarya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.
Thanks and regards,
Aishwarya
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page