Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
2020 Discussions

IMB Alltoall hang with Intel Parallel Studio 2018.0.3

4f0drlp7eyj3
Beginner
221 Views

Hi,

   When running IMB Alltoall at 32 ranks/node on 100 nodes, job stalls before printing the 0-byte data. Processes seem to be in sched_yield() when traced. With 2, 4, 8, or 16 ranks/node, job runs fine.

   Cluster is dual-socket Skylake, 18 cores/socket. ibv_devinfo shows as below. Running Centos 7.4. We've been having reproducible trouble with Intel MPI and high rank counts on our system, but are still troubleshooting whether it's a fabric or an MPI issue.

   Job launched with

srun -n 3200 --cpu-bind=verbose --ntasks-per-socket=16 src/IMB-MPI1 -npmin 3200 Alltoall

 

Thanks; Chris

 

[cchang@r4i2n26 ~]$ ibv_devinfo -v hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.21.1000 node_guid: 506b:4b03:002b:e41e sys_image_guid: 506b:4b03:002b:e41e vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: SGI_P0001721_X phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe17e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN XRC Unknown flags: 0xe16e0000 device_cap_exp_flags: 0x5048F8F100000000 EXP_DC_TRANSPORT EXP_CROSS_CHANNEL EXP_MR_ALLOCATE EXT_ATOMICS EXT_SEND NOP EXP_UMR EXP_ODP EXP_RX_CSUM_TCP_UDP_PKT EXP_RX_CSUM_IP_PKT EXP_DC_INFO EXP_MASKED_ATOMICS EXP_RX_TCP_UDP_PKT_TYPE EXP_PHYSICAL_RANGE_MR Unknown flags: 0x200000000000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) log atomic arg sizes (mask) 0x8 masked_log_atomic_arg_sizes (mask) 0x3c masked_log_atomic_arg_sizes_network_endianness (mask) 0x34 max fetch and add bit boundary 64 log max atomic inline 5 max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 hca_core_clock: 156250 max_klm_list_size: 65536 max_send_wqe_inline_klms: 20 max_umr_recursion_depth: 4 max_umr_stride_dimension: 1 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT max_size: 0xFFFFFFFFFFFFFFFF rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND dc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ xrc_odp_caps: NO SUPPORT raw_eth_odp_caps: NO SUPPORT max_dct: 262144 max_device_ctx: 1020 Multi-Packet RQ is not supported rx_pad_end_addr_align: 64 tso_caps: max_tso: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps ooo_caps: ooo_rc_caps = 0x0 ooo_xrc_caps = 0x0 ooo_dc_caps = 0x0 ooo_ud_caps = 0x0 sw_parsing_caps: supported_qp: tag matching not supported tunnel_offloads_caps: Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 2000 port_lmc: 0x00 link_layer: InfiniBand max_msg_sz: 0x40000000 port_cap_flags: 0x2651e848 max_vl_num: 4 (3) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 8 subnet_timeout: 18 init_type_reply: 0 active_width: 4X (2) active_speed: 25.0 Gbps (32) phys_state: LINK_UP (5) GID[ 0]: fec0:0000:0000:0000:506b:4b03:002b:e41e

0 Kudos
0 Replies
Reply