Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

IMB Alltoall hang with Intel Parallel Studio 2018.0.3

4f0drlp7eyj3
Beginner
485 Views

Hi,

   When running IMB Alltoall at 32 ranks/node on 100 nodes, job stalls before printing the 0-byte data. Processes seem to be in sched_yield() when traced. With 2, 4, 8, or 16 ranks/node, job runs fine.

   Cluster is dual-socket Skylake, 18 cores/socket. ibv_devinfo shows as below. Running Centos 7.4. We've been having reproducible trouble with Intel MPI and high rank counts on our system, but are still troubleshooting whether it's a fabric or an MPI issue.

   Job launched with

srun -n 3200 --cpu-bind=verbose --ntasks-per-socket=16 src/IMB-MPI1 -npmin 3200 Alltoall

 

Thanks; Chris

 

[cchang@r4i2n26 ~]$ ibv_devinfo -v hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.21.1000 node_guid: 506b:4b03:002b:e41e sys_image_guid: 506b:4b03:002b:e41e vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: SGI_P0001721_X phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe17e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN XRC Unknown flags: 0xe16e0000 device_cap_exp_flags: 0x5048F8F100000000 EXP_DC_TRANSPORT EXP_CROSS_CHANNEL EXP_MR_ALLOCATE EXT_ATOMICS EXT_SEND NOP EXP_UMR EXP_ODP EXP_RX_CSUM_TCP_UDP_PKT EXP_RX_CSUM_IP_PKT EXP_DC_INFO EXP_MASKED_ATOMICS EXP_RX_TCP_UDP_PKT_TYPE EXP_PHYSICAL_RANGE_MR Unknown flags: 0x200000000000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) log atomic arg sizes (mask) 0x8 masked_log_atomic_arg_sizes (mask) 0x3c masked_log_atomic_arg_sizes_network_endianness (mask) 0x34 max fetch and add bit boundary 64 log max atomic inline 5 max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 hca_core_clock: 156250 max_klm_list_size: 65536 max_send_wqe_inline_klms: 20 max_umr_recursion_depth: 4 max_umr_stride_dimension: 1 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT max_size: 0xFFFFFFFFFFFFFFFF rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND dc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ xrc_odp_caps: NO SUPPORT raw_eth_odp_caps: NO SUPPORT max_dct: 262144 max_device_ctx: 1020 Multi-Packet RQ is not supported rx_pad_end_addr_align: 64 tso_caps: max_tso: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps ooo_caps: ooo_rc_caps = 0x0 ooo_xrc_caps = 0x0 ooo_dc_caps = 0x0 ooo_ud_caps = 0x0 sw_parsing_caps: supported_qp: tag matching not supported tunnel_offloads_caps: Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 2000 port_lmc: 0x00 link_layer: InfiniBand max_msg_sz: 0x40000000 port_cap_flags: 0x2651e848 max_vl_num: 4 (3) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 8 subnet_timeout: 18 init_type_reply: 0 active_width: 4X (2) active_speed: 25.0 Gbps (32) phys_state: LINK_UP (5) GID[ 0]: fec0:0000:0000:0000:506b:4b03:002b:e41e

0 Kudos
0 Replies
Reply