Hi, I'm trying to run intel cluster checker (intel-clck-2019.3.5-025) and am getting an error in the hpl_cluster_performance module.
I've installed intel-mpi and intel-mkl both at version 2019.4-070 and then sourced:
source /opt/intel/compilers_and_libraries_2019.4.243/linux/bin/compilervars.sh intel64 source /opt/intel/compilers_and_libraries_2019.4.243/linux/mkl/bin/mklvars.sh intel64 source /opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpivars.sh
(as well as the relevant clckvars.sh)
If I run:
clck -f clck_nodes -l debug -F hpl_cluster_performance &> clck_debug.log
I get this:
<snip> openhpc-compute-0:  MPI startup(): libfabric version: 1.7.2a-impi openhpc-compute-0: openhpc-compute-0: openhpc-compute-0: stderr (540 bytes): openhpc-compute-0: Abort(1094799) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: openhpc-compute-0: MPIR_Init_thread(666)......: openhpc-compute-0: MPID_Init(922).............: openhpc-compute-0: MPIDI_NM_mpi_init_hook(719): OFI addrinfo() failed (ofi_init.h:719:MPIDI_NM_mpi_init_hook:No data available) <snip>
as suggested in other threads here before running clck still gives the same error message.
Any suggestions please??
Ok so by downgrading intel-mpi to 2018.4-057 it does now actually run the HPL benchmark but there's an error in getting results back from the compute nodes:
stderr (401 bytes): openhpc-compute-0: [firstname.lastname@example.org] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor) openhpc-compute-0: [email@example.com] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy openhpc-compute-0: openhpc-compute-0.novalocal.17592PSM2 no hfi units are available (err=23) openhpc-compute-0: openhpc-compute-1.novalocal.25118PSM2 no hfi units are available (err=23)
googling errors like this suggest network/connectivity problems and at the start of this provider there was this:
openhpc-compute-0: Data will be sent to tcp://10.0.0.19:49152
so I've confirmed that:
- passwordless ssh works in all directions
- netstat shows that port is opened during the run
Any other ideas?
Again "hfi units" seems to be a fabric error but under this version of intel mpi does not even have libfabric included so not sure whether/how that's relevant.