Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Jeon__ByoungSeon
Beginner
51 Views

intel mpi at 4000 ranks

Hi, we're testing intel mpi on Centos7.5 with infiniband connections.

Using intel mpi benchmark, small scale tests (10node, 400 mpi ranks)  looks OK while 100 nodes (4000 ranks) job crashes.  FI_LOG_LEVEL=debug yielded a following message:

libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:ofi_rxm:ep_ctrl:rxm_eq_sread():575<warn> fi_eq_readerr: err: 111, prov_err: Unknown error -28 (-28)
libfabric:verbs:fabric:fi_ibv_set_default_attr():1085<info> Ignoring provider default value for tx rma_iov_limit as it is greater than the value supported by domain: mlx5_0

Would there be any way to trace the cause of the issues? Any comments are appreciated.

Thanks,

BJ

0 Kudos
0 Replies
Reply