- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, after upgrading from oneAPI 2021.2.0 to 2021.3.0 I started to receive this error message when running Intel Optimized HPCG benchmark:
n01:rank471.xhpcg_avx2: unknown link width 0x10
n26:rank201.xhpcg_avx2: unknown link width 0x10
n04:rank489.xhpcg_avx2: unknown link width 0x10
n03:rank480.xhpcg_avx2: unknown link width 0x10
n26:rank202.xhpcg_avx2: unknown link width 0x10
n23:rank481.xhpcg_avx2: unknown link width 0x10
n43:rank338.xhpcg_avx2: unknown link width 0x10
n42:rank337.xhpcg_avx2: unknown link width 0x10
n4:rank490.xhpcg_avx2: unknown link width 0x10
n26:rank204.xhpcg_avx2: unknown link width 0x10
I've observed that libfabric provider has changed:
[0] MPI startup(): libfabric provider: psm3
It's probably related, but I want to know regarding the link width error.
Fabric is based on Mellanox Connect-X6 running at 100Gbps HDR, INBOX OFED, not using MLNX_OFED.
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
Could you please confirm your environment details like OS and the OFI provider(Mellanox/psm2(or)psm3) being used to encounter your issue?
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, sorry for the delayed answer. The email notification went to the spam folder.
But regarding the questions, this issue only happens when PSM3 is used. I've already tried all supported modes like PSM2, MLX and VERBS. It only happens with PSM3.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you. Is your issue resolved? If not could you please confirm your environment details like OS and the OFI provider(Mellanox/psm2(or)psm3) being used to encounter your issue?
Thanks & regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please can you test I_MPI_OFI_PRVIDER=mlx or =verbs
Please provide the output of:
fi_info -l
ucx_info -f (if available)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For sure, here it is, ucx_info is provided by the OpenHPC package.
[root@adano31 ~]# fi_info -l
psm2:
version: 112.10
mlx:
version: 1.4
psm3:
version: 111.20
ofi_rxm:
version: 111.0
verbs:
version: 112.10
tcp:
version: 111.0
sockets:
version: 112.10
shm:
version: 112.10
ofi_hook_noop:
version: 112.10
[root@adano31 ~]# ucx_info -f | grep -v \#
UCX_LOG_LEVEL=WARN
UCX_LOG_FILE=
UCX_LOG_FILE_SIZE=inf
UCX_LOG_FILE_ROTATE=0
UCX_LOG_BUFFER=1K
UCX_LOG_DATA_SIZE=0
UCX_LOG_PRINT_ENABLE=n
UCX_HANDLE_ERRORS=bt
UCX_ERROR_SIGNALS=ILL,SEGV,BUS,FPE
UCX_ERROR_MAIL_TO=
UCX_ERROR_MAIL_FOOTER=
UCX_GDB_COMMAND=gdb -quiet
UCX_DEBUG_SIGNO=HUP
UCX_LOG_LEVEL_TRIGGER=FATAL
UCX_WARN_UNUSED_ENV_VARS=n
UCX_ASYNC_MAX_EVENTS=1024
UCX_ASYNC_SIGNO=ALRM
UCX_PROFILE_MODE=
UCX_PROFILE_FILE=ucx_%h_%p.prof
UCX_PROFILE_LOG_SIZE=4M
UCX_RCACHE_CHECK_PFN=0
UCX_MODULE_DIR=/opt/ohpc/pub/mpi/ucx-ohpc/1.9.0/lib/ucx
UCX_MODULE_LOG_LEVEL=TRACE
UCX_BUILTIN_MEMCPY_MIN=auto
UCX_BUILTIN_MEMCPY_MAX=auto
UCX_MEM_LOG_LEVEL=WARN
UCX_MEM_ALLOC_ALIGN=16
UCX_MEM_EVENTS=y
UCX_MEM_MMAP_HOOK_MODE=bistro
UCX_MEM_MALLOC_HOOKS=y
UCX_MEM_MALLOC_RELOC=y
UCX_MEM_CUDA_RELOC=y
UCX_MEM_DYNAMIC_MMAP_THRESH=y
UCX_MEM_DLOPEN_PROCESS_RPATH=y
UCX_POSIX_HUGETLB_MODE=try
UCX_POSIX_DIR=/dev/shm
UCX_POSIX_USE_PROC_LINK=y
UCX_MM_ALLOC=md,mmap,heap
UCX_MM_FAILURE=ERROR
UCX_MM_MAX_NUM_EPS=inf
UCX_MM_BW=12179.00MBps
UCX_MM_FIFO_SIZE=64
UCX_MM_SEG_SIZE=8256
UCX_MM_FIFO_RELEASE_FACTOR=0.500
UCX_MM_RX_MAX_BUFS=-1
UCX_MM_RX_BUFS_GROW=512
UCX_MM_FIFO_HUGETLB=n
UCX_MM_FIFO_ELEM_SIZE=128
UCX_MM_FIFO_MAX_POLL=16
UCX_SYSV_HUGETLB_MODE=try
UCX_MM_ALLOC=md,mmap,heap
UCX_MM_FAILURE=ERROR
UCX_MM_MAX_NUM_EPS=inf
UCX_MM_BW=12179.00MBps
UCX_MM_FIFO_SIZE=64
UCX_MM_SEG_SIZE=8256
UCX_MM_FIFO_RELEASE_FACTOR=0.500
UCX_MM_RX_MAX_BUFS=-1
UCX_MM_RX_BUFS_GROW=512
UCX_MM_FIFO_HUGETLB=n
UCX_MM_FIFO_ELEM_SIZE=128
UCX_MM_FIFO_MAX_POLL=16
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_SELF_FAILURE=ERROR
UCX_SELF_MAX_NUM_EPS=inf
UCX_SELF_SEG_SIZE=8K
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_FAILURE=ERROR
UCX_TCP_MAX_NUM_EPS=256
UCX_TCP_TX_SEG_SIZE=8K
UCX_TCP_RX_SEG_SIZE=64K
UCX_TCP_MAX_IOV=6
UCX_TCP_SENDV_THRESH=2K
UCX_TCP_PREFER_DEFAULT=y
UCX_TCP_PUT_ENABLE=y
UCX_TCP_CONN_NB=n
UCX_TCP_MAX_POLL=16
UCX_TCP_MAX_CONN_RETRIES=25
UCX_TCP_NODELAY=y
UCX_TCP_SNDBUF=auto
UCX_TCP_RCVBUF=auto
UCX_TCP_SYN_CNT=auto
UCX_TCP_TX_MAX_BUFS=-1
UCX_TCP_TX_BUFS_GROW=8
UCX_TCP_RX_MAX_BUFS=-1
UCX_TCP_RX_BUFS_GROW=8
UCX_TCP_CM_PRIV_DATA_LEN=2K
UCX_TCP_CM_SNDBUF=auto
UCX_TCP_CM_RCVBUF=auto
UCX_TCP_CM_SYN_CNT=auto
UCX_SOCKCM_ALLOC=huge,thp,md,mmap,heap
UCX_SOCKCM_FAILURE=ERROR
UCX_SOCKCM_MAX_NUM_EPS=inf
UCX_SOCKCM_BACKLOG=1024
UCX_NET_DEVICES=all
UCX_SHM_DEVICES=all
UCX_ACC_DEVICES=all
UCX_SELF_DEVICES=all
UCX_TLS=all
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_SOCKADDR_TLS_PRIORITY=rdmacm,sockcm
UCX_SOCKADDR_AUX_TLS=ud
UCX_WARN_INVALID_CONFIG=y
UCX_BCOPY_THRESH=0
UCX_RNDV_THRESH=auto
UCX_RNDV_SEND_NBR_THRESH=256K
UCX_RNDV_THRESH_FALLBACK=inf
UCX_RNDV_PERF_DIFF=1.000
UCX_MULTI_LANE_MAX_RATIO=10.000
UCX_MAX_EAGER_RAILS=1
UCX_MAX_RNDV_RAILS=2
UCX_RNDV_SCHEME=auto
UCX_RKEY_PTR_SEG_SIZE=512K
UCX_ZCOPY_THRESH=auto
UCX_BCOPY_BW=auto
UCX_ATOMIC_MODE=guess
UCX_ADDRESS_DEBUG_INFO=n
UCX_MAX_WORKER_NAME=32
UCX_USE_MT_MUTEX=n
UCX_ADAPTIVE_PROGRESS=y
UCX_SEG_SIZE=8K
UCX_TM_THRESH=1K
UCX_TM_MAX_BB_SIZE=1K
UCX_TM_FORCE_THRESH=8K
UCX_TM_SW_RNDV=n
UCX_NUM_EPS=auto
UCX_NUM_PPN=auto
UCX_RNDV_FRAG_SIZE=512K
UCX_RNDV_PIPELINE_SEND_THRESH=inf
UCX_MEMTYPE_CACHE=y
UCX_FLUSH_WORKER_EPS=y
UCX_UNIFIED_MODE=n
UCX_SOCKADDR_CM_ENABLE=n
UCX_PROTO_ENABLE=n
UCX_IB_REG_METHODS=rcache,odp,direct
UCX_IB_RCACHE_MEM_PRIO=1000
UCX_IB_RCACHE_OVERHEAD=0.18us
UCX_IB_RCACHE_ADDR_ALIGN=16
UCX_IB_MEM_REG_OVERHEAD=16.00us
UCX_IB_MEM_REG_GROWTH=0.00us
UCX_IB_FORK_INIT=try
UCX_IB_ASYNC_EVENTS=n
UCX_IB_ETH_PAUSE_ON=y
UCX_IB_ODP_NUMA_POLICY=preferred
UCX_IB_ODP_PREFETCH=n
UCX_IB_ODP_MAX_SIZE=auto
UCX_IB_DEVICE_SPECS=
UCX_IB_PREFER_NEAREST_DEVICE=y
UCX_IB_INDIRECT_ATOMIC=y
UCX_IB_GID_INDEX=auto
UCX_IB_SUBNET_PREFIX=
UCX_IB_GPU_DIRECT_RDMA=try
UCX_IB_PCI_BW=
UCX_IB_MLX5_DEVX=try
UCX_IB_MLX5_DEVX_OBJECTS=rcqp,rcsrq,dct,dcsrq
UCX_IB_REG_MT_THRESH=4G
UCX_IB_REG_MT_CHUNK=2G
UCX_IB_REG_MT_BIND=n
UCX_IB_PCI_RELAXED_ORDERING=auto
UCX_RC_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_RC_VERBS_FAILURE=ERROR
UCX_RC_VERBS_MAX_NUM_EPS=256
UCX_RC_VERBS_SEG_SIZE=8256
UCX_RC_VERBS_TX_QUEUE_LEN=256
UCX_RC_VERBS_TX_MAX_BATCH=16
UCX_RC_VERBS_TX_MAX_POLL=16
UCX_RC_VERBS_TX_MIN_INLINE=64
UCX_RC_VERBS_TX_INLINE_RESP=64
UCX_RC_VERBS_TX_MIN_SGE=3
UCX_RC_VERBS_TX_MAX_BUFS=-1
UCX_RC_VERBS_TX_BUFS_GROW=1024
UCX_RC_VERBS_RX_QUEUE_LEN=4095
UCX_RC_VERBS_RX_MAX_BATCH=16
UCX_RC_VERBS_RX_MAX_POLL=16
UCX_RC_VERBS_RX_INLINE=64
UCX_RC_VERBS_RX_MAX_BUFS=-1
UCX_RC_VERBS_RX_BUFS_GROW=0
UCX_RC_VERBS_ADDR_TYPE=auto
UCX_RC_VERBS_IS_GLOBAL=n
UCX_RC_VERBS_SL=0
UCX_RC_VERBS_TRAFFIC_CLASS=auto
UCX_RC_VERBS_HOP_LIMIT=255
UCX_RC_VERBS_NUM_PATHS=auto
UCX_RC_VERBS_ROCE_PATH_FACTOR=1
UCX_RC_VERBS_LID_PATH_BITS=0
UCX_RC_VERBS_PKEY=auto
UCX_RC_VERBS_PATH_MTU=default
UCX_RC_VERBS_ENABLE_CUDA_AFFINITY=y
UCX_RC_VERBS_MAX_RD_ATOMIC=4
UCX_RC_VERBS_TIMEOUT=1000000.00us
UCX_RC_VERBS_RETRY_COUNT=7
UCX_RC_VERBS_RNR_TIMEOUT=1000.00us
UCX_RC_VERBS_RNR_RETRY_COUNT=7
UCX_RC_VERBS_FC_ENABLE=y
UCX_RC_VERBS_FC_WND_SIZE=512
UCX_RC_VERBS_FC_HARD_THRESH=0.250
UCX_RC_VERBS_FENCE=auto
UCX_RC_VERBS_MAX_GET_ZCOPY=auto
UCX_RC_VERBS_TX_NUM_GET_BYTES=inf
UCX_RC_VERBS_FC_SOFT_THRESH=0.500
UCX_RC_VERBS_TX_CQ_MODERATION=64
UCX_RC_VERBS_TX_CQ_LEN=4096
UCX_RC_VERBS_MAX_AM_HDR=128
UCX_RC_VERBS_TX_MAX_WR=inf
UCX_RC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_RC_MLX5_FAILURE=ERROR
UCX_RC_MLX5_MAX_NUM_EPS=256
UCX_RC_MLX5_SEG_SIZE=8256
UCX_RC_MLX5_TX_QUEUE_LEN=256
UCX_RC_MLX5_TX_MAX_BATCH=16
UCX_RC_MLX5_TX_MAX_POLL=16
UCX_RC_MLX5_TX_MIN_INLINE=64
UCX_RC_MLX5_TX_INLINE_RESP=64
UCX_RC_MLX5_TX_MIN_SGE=3
UCX_RC_MLX5_TX_MAX_BUFS=-1
UCX_RC_MLX5_TX_BUFS_GROW=1024
UCX_RC_MLX5_RX_QUEUE_LEN=4095
UCX_RC_MLX5_RX_MAX_BATCH=16
UCX_RC_MLX5_RX_MAX_POLL=16
UCX_RC_MLX5_RX_INLINE=64
UCX_RC_MLX5_RX_MAX_BUFS=-1
UCX_RC_MLX5_RX_BUFS_GROW=0
UCX_RC_MLX5_ADDR_TYPE=auto
UCX_RC_MLX5_IS_GLOBAL=n
UCX_RC_MLX5_SL=0
UCX_RC_MLX5_TRAFFIC_CLASS=auto
UCX_RC_MLX5_HOP_LIMIT=255
UCX_RC_MLX5_NUM_PATHS=auto
UCX_RC_MLX5_ROCE_PATH_FACTOR=1
UCX_RC_MLX5_LID_PATH_BITS=0
UCX_RC_MLX5_PKEY=auto
UCX_RC_MLX5_PATH_MTU=default
UCX_RC_MLX5_ENABLE_CUDA_AFFINITY=y
UCX_RC_MLX5_MAX_RD_ATOMIC=4
UCX_RC_MLX5_TIMEOUT=1000000.00us
UCX_RC_MLX5_RETRY_COUNT=7
UCX_RC_MLX5_RNR_TIMEOUT=1000.00us
UCX_RC_MLX5_RNR_RETRY_COUNT=7
UCX_RC_MLX5_FC_ENABLE=y
UCX_RC_MLX5_FC_WND_SIZE=512
UCX_RC_MLX5_FC_HARD_THRESH=0.250
UCX_RC_MLX5_FENCE=auto
UCX_RC_MLX5_MAX_GET_ZCOPY=auto
UCX_RC_MLX5_TX_NUM_GET_BYTES=inf
UCX_RC_MLX5_FC_SOFT_THRESH=0.500
UCX_RC_MLX5_TX_CQ_MODERATION=64
UCX_RC_MLX5_TX_CQ_LEN=4096
UCX_RC_MLX5_DM_SIZE=2K
UCX_RC_MLX5_DM_COUNT=1
UCX_RC_MLX5_MMIO_MODE=auto
UCX_RC_MLX5_TX_MAX_BB=inf
UCX_RC_MLX5_TM_ENABLE=n
UCX_RC_MLX5_TM_LIST_SIZE=1024
UCX_RC_MLX5_TM_SEG_SIZE=48K
UCX_RC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_RC_MLX5_TM_MP_NUM_STRIDES=8
UCX_RC_MLX5_EXP_BACKOFF=0
UCX_RC_MLX5_CYCLIC_SRQ_ENABLE=try
UCX_DC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_DC_MLX5_FAILURE=ERROR
UCX_DC_MLX5_MAX_NUM_EPS=inf
UCX_DC_MLX5_SEG_SIZE=8256
UCX_DC_MLX5_TX_QUEUE_LEN=128
UCX_DC_MLX5_TX_MAX_BATCH=16
UCX_DC_MLX5_TX_MAX_POLL=16
UCX_DC_MLX5_TX_MIN_INLINE=64
UCX_DC_MLX5_TX_INLINE_RESP=64
UCX_DC_MLX5_TX_MIN_SGE=3
UCX_DC_MLX5_TX_MAX_BUFS=-1
UCX_DC_MLX5_TX_BUFS_GROW=1024
UCX_DC_MLX5_RX_QUEUE_LEN=4095
UCX_DC_MLX5_RX_MAX_BATCH=16
UCX_DC_MLX5_RX_MAX_POLL=16
UCX_DC_MLX5_RX_INLINE=64
UCX_DC_MLX5_RX_MAX_BUFS=-1
UCX_DC_MLX5_RX_BUFS_GROW=0
UCX_DC_MLX5_ADDR_TYPE=auto
UCX_DC_MLX5_IS_GLOBAL=n
UCX_DC_MLX5_SL=0
UCX_DC_MLX5_TRAFFIC_CLASS=auto
UCX_DC_MLX5_HOP_LIMIT=255
UCX_DC_MLX5_NUM_PATHS=auto
UCX_DC_MLX5_ROCE_PATH_FACTOR=1
UCX_DC_MLX5_LID_PATH_BITS=0
UCX_DC_MLX5_PKEY=auto
UCX_DC_MLX5_PATH_MTU=default
UCX_DC_MLX5_ENABLE_CUDA_AFFINITY=y
UCX_DC_MLX5_MAX_RD_ATOMIC=4
UCX_DC_MLX5_TIMEOUT=1000000.00us
UCX_DC_MLX5_RETRY_COUNT=7
UCX_DC_MLX5_RNR_TIMEOUT=1000.00us
UCX_DC_MLX5_RNR_RETRY_COUNT=7
UCX_DC_MLX5_FC_ENABLE=y
UCX_DC_MLX5_FC_WND_SIZE=512
UCX_DC_MLX5_FC_HARD_THRESH=0.250
UCX_DC_MLX5_FENCE=auto
UCX_DC_MLX5_MAX_GET_ZCOPY=auto
UCX_DC_MLX5_TX_NUM_GET_BYTES=inf
UCX_DC_MLX5_DM_SIZE=2K
UCX_DC_MLX5_DM_COUNT=1
UCX_DC_MLX5_MMIO_MODE=auto
UCX_DC_MLX5_TX_MAX_BB=inf
UCX_DC_MLX5_TM_ENABLE=n
UCX_DC_MLX5_TM_LIST_SIZE=1024
UCX_DC_MLX5_TM_SEG_SIZE=48K
UCX_DC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_DC_MLX5_TM_MP_NUM_STRIDES=8
UCX_DC_MLX5_EXP_BACKOFF=0
UCX_DC_MLX5_CYCLIC_SRQ_ENABLE=try
UCX_DC_MLX5_RX_QUEUE_LEN_INIT=128
UCX_DC_MLX5_NUM_DCI=8
UCX_DC_MLX5_TX_POLICY=dcs_quota
UCX_DC_MLX5_RAND_DCI_SEED=0
UCX_DC_MLX5_QUOTA=32
UCX_DC_MLX5_COMPACT_AV=y
UCX_UD_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_UD_VERBS_FAILURE=ERROR
UCX_UD_VERBS_MAX_NUM_EPS=inf
UCX_UD_VERBS_SEG_SIZE=8K
UCX_UD_VERBS_TX_QUEUE_LEN=256
UCX_UD_VERBS_TX_MAX_BATCH=16
UCX_UD_VERBS_TX_MAX_POLL=16
UCX_UD_VERBS_TX_MIN_INLINE=64
UCX_UD_VERBS_TX_INLINE_RESP=0
UCX_UD_VERBS_TX_MIN_SGE=3
UCX_UD_VERBS_TX_MAX_BUFS=-1
UCX_UD_VERBS_TX_BUFS_GROW=1024
UCX_UD_VERBS_RX_QUEUE_LEN=4096
UCX_UD_VERBS_RX_MAX_BATCH=16
UCX_UD_VERBS_RX_MAX_POLL=16
UCX_UD_VERBS_RX_INLINE=0
UCX_UD_VERBS_RX_MAX_BUFS=-1
UCX_UD_VERBS_RX_BUFS_GROW=0
UCX_UD_VERBS_ADDR_TYPE=auto
UCX_UD_VERBS_IS_GLOBAL=n
UCX_UD_VERBS_SL=0
UCX_UD_VERBS_TRAFFIC_CLASS=auto
UCX_UD_VERBS_HOP_LIMIT=255
UCX_UD_VERBS_NUM_PATHS=auto
UCX_UD_VERBS_ROCE_PATH_FACTOR=1
UCX_UD_VERBS_LID_PATH_BITS=0
UCX_UD_VERBS_PKEY=auto
UCX_UD_VERBS_PATH_MTU=default
UCX_UD_VERBS_ENABLE_CUDA_AFFINITY=y
UCX_UD_VERBS_RX_QUEUE_LEN_INIT=128
UCX_UD_VERBS_TIMEOUT=300000000.00us
UCX_UD_VERBS_TIMER_TICK=10000.00us
UCX_UD_VERBS_TIMER_BACKOFF=2.000
UCX_UD_VERBS_ASYNC_TIMER_TICK=100000.00us
UCX_UD_VERBS_ETH_DGID_CHECK=y
UCX_UD_VERBS_MAX_WINDOW=1025
UCX_UD_VERBS_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_UD_MLX5_FAILURE=ERROR
UCX_UD_MLX5_MAX_NUM_EPS=inf
UCX_UD_MLX5_SEG_SIZE=8K
UCX_UD_MLX5_TX_QUEUE_LEN=256
UCX_UD_MLX5_TX_MAX_BATCH=16
UCX_UD_MLX5_TX_MAX_POLL=16
UCX_UD_MLX5_TX_MIN_INLINE=64
UCX_UD_MLX5_TX_INLINE_RESP=0
UCX_UD_MLX5_TX_MIN_SGE=3
UCX_UD_MLX5_TX_MAX_BUFS=-1
UCX_UD_MLX5_TX_BUFS_GROW=1024
UCX_UD_MLX5_RX_QUEUE_LEN=4096
UCX_UD_MLX5_RX_MAX_BATCH=16
UCX_UD_MLX5_RX_MAX_POLL=16
UCX_UD_MLX5_RX_INLINE=0
UCX_UD_MLX5_RX_MAX_BUFS=-1
UCX_UD_MLX5_RX_BUFS_GROW=0
UCX_UD_MLX5_ADDR_TYPE=auto
UCX_UD_MLX5_IS_GLOBAL=n
UCX_UD_MLX5_SL=0
UCX_UD_MLX5_TRAFFIC_CLASS=auto
UCX_UD_MLX5_HOP_LIMIT=255
UCX_UD_MLX5_NUM_PATHS=auto
UCX_UD_MLX5_ROCE_PATH_FACTOR=1
UCX_UD_MLX5_LID_PATH_BITS=0
UCX_UD_MLX5_PKEY=auto
UCX_UD_MLX5_PATH_MTU=default
UCX_UD_MLX5_ENABLE_CUDA_AFFINITY=y
UCX_UD_MLX5_RX_QUEUE_LEN_INIT=128
UCX_UD_MLX5_TIMEOUT=300000000.00us
UCX_UD_MLX5_TIMER_TICK=10000.00us
UCX_UD_MLX5_TIMER_BACKOFF=2.000
UCX_UD_MLX5_ASYNC_TIMER_TICK=100000.00us
UCX_UD_MLX5_ETH_DGID_CHECK=y
UCX_UD_MLX5_MAX_WINDOW=1025
UCX_UD_MLX5_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_DM_SIZE=2K
UCX_UD_MLX5_DM_COUNT=1
UCX_UD_MLX5_MMIO_MODE=auto
UCX_UD_MLX5_COMPACT_AV=y
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Obviously a typo: I_MPI_OFI_PROVIDER=mlx or =verbs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the information. You wrote
"... this issue only happens when PSM3 is used. I've already tried all supported modes like PSM2, MLX and VERBS. It only happens with PSM3."
I understand that MLX and VERBS both work well.
Does the PSM3 run continue, or does it fail?
Engineering provided the information that the message "n01:rank471.xhpcg_avx2: unknown link width 0x10" comes from PSM3 level. PSM3 tries to identify the speed of NIC using port active width. For some reason your NIC has a width value that is not included in the switch - case calculations on PSM3 side. But PSM3 should not fail in that case, PSM3 just assumes 100Gbps if the width is unknown. Here is link to that part of code (https://github.com/intel/eth-psm3-fi/blob/master/psm3/psm_verbs_ep.c#L2124)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page