Community
cancel
Showing results for 
Search instead for 
Did you mean: 
brasier__steve
Beginner
348 Views

intel cluster checker OFI problem

Hi, I'm trying to run intel cluster checker  (intel-clck-2019.3.5-025) and am getting an error in the hpl_cluster_performance module.

I've installed intel-mpi and intel-mkl both at version 2019.4-070 and then sourced:

source /opt/intel/compilers_and_libraries_2019.4.243/linux/bin/compilervars.sh intel64
source /opt/intel/compilers_and_libraries_2019.4.243/linux/mkl/bin/mklvars.sh intel64
source /opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpivars.sh

(as well as the relevant clckvars.sh)

If I run:

clck -f clck_nodes -l debug -F hpl_cluster_performance &> clck_debug.log

I get this:

<snip>
openhpc-compute-0: [0] MPI startup(): libfabric version: 1.7.2a-impi
openhpc-compute-0:
openhpc-compute-0:
openhpc-compute-0: stderr (540 bytes):
openhpc-compute-0: Abort(1094799) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
openhpc-compute-0: MPIR_Init_thread(666)......:
openhpc-compute-0: MPID_Init(922).............:
openhpc-compute-0: MPIDI_NM_mpi_init_hook(719): OFI addrinfo() failed (ofi_init.h:719:MPIDI_NM_mpi_init_hook:No data available)

<snip>

 

trying

export FI_PROVIDER=sockets

or

export FI_PROVIDER=tcp

as suggested in other threads here before running clck still gives the same error message.

Any suggestions please??

0 Kudos
2 Replies
brasier__steve
Beginner
348 Views

Ok so some more info: running fi_info returns "fi_getinfo: -61". Which seems to mean "No information at all".

brasier__steve
Beginner
348 Views

Ok so by downgrading intel-mpi to 2018.4-057 it does now actually run the HPL benchmark but there's an error in getting results back from the compute nodes:

stderr (401 bytes):
openhpc-compute-0: [mpiexec@openhpc-compute-0.novalocal] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
openhpc-compute-0: [mpiexec@openhpc-compute-0.novalocal] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy
openhpc-compute-0: openhpc-compute-0.novalocal.17592PSM2 no hfi units are available (err=23)
openhpc-compute-0: openhpc-compute-1.novalocal.25118PSM2 no hfi units are available (err=23)

googling errors like this suggest network/connectivity problems and at the start of this provider there was this:

openhpc-compute-0: Data will be sent to tcp://10.0.0.19:49152

so I've confirmed that:

- passwordless ssh works in all directions

- netstat shows that port is opened during the run

Any other ideas?

Again "hfi units" seems to be a fabric error but under this version of intel mpi does not even have libfabric included so not sure whether/how that's relevant.

Reply