Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

intel mpi at 4000 ranks

Jeon__ByoungSeon
Beginner
148 Views

Hi, we're testing intel mpi on Centos7.5 with infiniband connections.

Using intel mpi benchmark, small scale tests (10node, 400 mpi ranks)  looks OK while 100 nodes (4000 ranks) job crashes.  FI_LOG_LEVEL=debug yielded a following message:

libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:ofi_rxm:ep_ctrl:rxm_eq_sread():575<warn> fi_eq_readerr: err: 111, prov_err: Unknown error -28 (-28)
libfabric:verbs:fabric:fi_ibv_set_default_attr():1085<info> Ignoring provider default value for tx rma_iov_limit as it is greater than the value supported by domain: mlx5_0

Would there be any way to trace the cause of the issues? Any comments are appreciated.

Thanks,

BJ

0 Kudos
0 Replies
Reply