Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Intel oneAPI HPC MPI issue

Stevec226
Beginner
327 Views

We have a couple different versions of the Intel oneAPI HPC Cluster edition installed on CentOS 7.x cluster with Infiniband network interface. When we attempt to use Intel MPI with more than 1 node we receive the following errors:

Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(178)........:
MPID_Init(1532)..............:
MPIDI_OFI_mpi_init_hook(1474):
(unknown)(): Other MPI error

 

Any one else dealt with these errors? What is the solution?

 

Thanks

 

Steve

0 Kudos
5 Replies
SantoshY_Intel
Moderator
286 Views

Hi,

 

Thank you for posting in the Intel forums.

 

Could you please provide us the below details to investigate more on your issue?

  1. Could you please provide us with the version of Intel MPI you are using?
    To check the version use the command:
    mpirun --version​
  2. The output of the below commands:
    fi_info -l​
    lscpu
  3. Could you please provide the sample reproducer code?
  4. What is the command you used to launch the MPI program on 2 nodes?
  5. Could you please provide us the complete debug log using the command given below:
    I_MPI_DEBUG=30 FI_LOG_LEVEL=Debug mpirun -n <number-of-processes> -ppn <processes-per-node> -f <hostfile> ./myprog​
  6. Could you please confirm if you can run the below command without any error?
    mpirun -n 4 -hosts <host1>,<host2> hostname​
  7. Also provide us with the result of below command: 
    clck -f <nodefile>​

     

 

Thanks & Regards,

Santosh

 

Stevec226
Beginner
272 Views

Santosh,

 

Thanks for the reply. Just for more information, we have been able to run the Intel MPI up until we installed the Intel oneAPI versions. We have versions 2021.1.1, 2021.4, 2022.2 and all of them produce the same error when trying to use multiple nodes. The intel suite for version 2020 and previous seem to work properly.

1. % mpirun --version

Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)
Copyright 2003-2022, Intel Corporation.

2.1. % fi_info -l

mlx:
version: 1.4
psm3:
version: 1102.0
ofi_rxm:
version: 113.20
verbs:
version: 113.20
tcp:
version: 113.20
sockets:
version: 113.20
shm:
version: 114.0
ofi_hook_noop:
version: 113.20

2.2. % lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-15
Off-line CPU(s) list: 16-31
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
Stepping: 2
CPU MHz: 1200.024
CPU max MHz: 3400.0000
CPU min MHz: 1200.0000
BogoMIPS: 5199.92
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

3. % more helloWorld.f90

program helloWorld
include "mpif.h"
integer :: error_code, processor_id, bcast_data
call mpi_init(error_code)
call mpi_comm_rank(mpi_comm_world,processor_id,error_code)
bcast_data = 0 ; if ( 0 == processor_id ) bcast_data = 5
call mpi_bcast(bcast_data, 1, mpi_integer,0,mpi_comm_world,error_code)
write(*,'(i8,a,i2)')processor_id,' says, "Hello World!" 5 =',bcast_data
call mpi_finalize(error_code)
end program helloWorld

 

-- Compile using "mpiifort -o hw helloWorld.f90"

 

4. We use PBS Pro, so I run "qsub -I -q <queue> -lselect=2"

once on node I run "mpirun -np 32 ./hw"

5. % I_MPI_DEBUG=30 FI_LOG_LEVEL=Debug mpirun -np 16 ./hw

[mpiexec@k3r4n17] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on k3r4n53.ccf-beowulf.ndc.nasa.gov (pid 13004, exit code 256)
[mpiexec@k3r4n17] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@k3r4n17] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@k3r4n17] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
[mpiexec@k3r4n17] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@k3r4n17] Possible reasons:
[mpiexec@k3r4n17] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@k3r4n17] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@k3r4n17] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@k3r4n17] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.

6.  % mpirun -n 4 -hosts k3r4n17,k3r4n53 hostname

k3r4n17
k3r4n17
k3r4n17
k3r4n17

7. % clck -f

bash: clck: command not found

SantoshY_Intel
Moderator
227 Views

Hi,

 

Thanks for providing all the details.

 

>>once on node I run "mpirun -np 32 ./hw"

Please use the below command to run the MPI application on two nodes:

FI_PROVIDER=mlx/tcp/psm3 mpirun -n 32 -ppn 16 -hosts k3r4n17,k3r4n53 ./hw

After running the above command, If you face any error then please provide us the complete debug log using the below command:

I_MPI_DEBUG=30 FI_PROVIDER=mlx/tcp/psm3 mpirun -v -n 32 -ppn 16 -hosts k3r4n17,k3r4n53 ./hw

Note: Set FI_PROVIDER as mlx or tcp or psm3. Try these 3 options one at a time and let us know your findings.

Also please run the below command to test if a simple hostname runs on 2 nodes:

mpirun -n 4 -ppn 2 -hosts k3r4n17,k3r4n53 hostname

 

>>"% clck -f

bash: clck: command not found"

1. Initialize the oneAPI environment using the below command:

source /opt/intel/oneapi/setvars.sh

2. Now create a hostfile as below:

$cat hostfile
k3r4n17
k3r4n53

3. Now run the clck command using the below command:

clck -f hostfile

For more information please refer to the Getting started guide of Intel Cluster Checker using the below link:

https://www.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/getting...

 

Thanks & Regards,

Santosh

 

SantoshY_Intel
Moderator
193 Views

Hi,


We haven't heard back from you. Could you please provide us with any updates on your issue?


Thanks & Regards,

Santosh


SantoshY_Intel
Moderator
135 Views

Hi,


I assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel. 


Thanks & Regards,

Santosh


Reply