Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
1911 Discussions

oneAPI MPI 2021.3.0: unknown link width 0x10

Ferrao__Vinicius
1,213 Views

Hello, after upgrading from oneAPI 2021.2.0 to 2021.3.0 I started to receive this error message when running Intel Optimized HPCG benchmark:

 

n01:rank471.xhpcg_avx2: unknown link width 0x10
n26:rank201.xhpcg_avx2: unknown link width 0x10
n04:rank489.xhpcg_avx2: unknown link width 0x10
n03:rank480.xhpcg_avx2: unknown link width 0x10
n26:rank202.xhpcg_avx2: unknown link width 0x10
n23:rank481.xhpcg_avx2: unknown link width 0x10
n43:rank338.xhpcg_avx2: unknown link width 0x10
n42:rank337.xhpcg_avx2: unknown link width 0x10
n4:rank490.xhpcg_avx2: unknown link width 0x10
n26:rank204.xhpcg_avx2: unknown link width 0x10

 

I've observed that libfabric provider has changed:

[0] MPI startup(): libfabric provider: psm3

It's probably related, but I want to know regarding the link width error.

Fabric is based on Mellanox Connect-X6 running at 100Gbps HDR, INBOX OFED, not using MLNX_OFED.

 

Thanks.

7 Replies
SantoshY_Intel
Moderator
1,192 Views

Hi,

 

Thanks for reaching out to us.

Could you please confirm your environment details like OS and the OFI provider(Mellanox/psm2(or)psm3) being used to encounter your issue?

 

Thanks & Regards,

Santosh

 

Ferrao__Vinicius
1,102 Views

Hi, sorry for the delayed answer. The email notification went to the spam folder.

But regarding the questions, this issue only happens when PSM3 is used. I've already tried all supported modes like PSM2, MLX and VERBS. It only happens with PSM3.

 

SantoshY_Intel
Moderator
1,155 Views

Hi,

 

We haven't heard back from you. Is your issue resolved? If not could you please confirm your environment details like OS and the OFI provider(Mellanox/psm2(or)psm3) being used to encounter your issue?

 

Thanks & regards,

Santosh

 

Klaus-Dieter_O_Intel
1,134 Views

Please can you test I_MPI_OFI_PRVIDER=mlx or =verbs


Please provide the output of:

fi_info -l

ucx_info -f (if available)


Ferrao__Vinicius
1,102 Views

For sure, here it is, ucx_info is provided by the OpenHPC package.

 

[root@adano31 ~]# fi_info -l

psm2:

    version: 112.10

mlx:

    version: 1.4

psm3:

    version: 111.20

ofi_rxm:

    version: 111.0

verbs:

    version: 112.10

tcp:

    version: 111.0

sockets:

    version: 112.10

shm:

    version: 112.10

ofi_hook_noop:

    version: 112.10

[root@adano31 ~]# ucx_info -f | grep -v \#

 

 

UCX_LOG_LEVEL=WARN

 

UCX_LOG_FILE=

 

UCX_LOG_FILE_SIZE=inf

 

UCX_LOG_FILE_ROTATE=0

 

UCX_LOG_BUFFER=1K

 

UCX_LOG_DATA_SIZE=0

 

UCX_LOG_PRINT_ENABLE=n

 

UCX_HANDLE_ERRORS=bt

 

UCX_ERROR_SIGNALS=ILL,SEGV,BUS,FPE

 

UCX_ERROR_MAIL_TO=

 

UCX_ERROR_MAIL_FOOTER=

 

UCX_GDB_COMMAND=gdb -quiet

 

UCX_DEBUG_SIGNO=HUP

 

UCX_LOG_LEVEL_TRIGGER=FATAL

 

UCX_WARN_UNUSED_ENV_VARS=n

 

UCX_ASYNC_MAX_EVENTS=1024

 

UCX_ASYNC_SIGNO=ALRM

 

UCX_PROFILE_MODE=

 

UCX_PROFILE_FILE=ucx_%h_%p.prof

 

UCX_PROFILE_LOG_SIZE=4M

 

UCX_RCACHE_CHECK_PFN=0

 

UCX_MODULE_DIR=/opt/ohpc/pub/mpi/ucx-ohpc/1.9.0/lib/ucx

 

UCX_MODULE_LOG_LEVEL=TRACE

 

UCX_BUILTIN_MEMCPY_MIN=auto

 

UCX_BUILTIN_MEMCPY_MAX=auto

 

 

 

 

UCX_MEM_LOG_LEVEL=WARN

 

UCX_MEM_ALLOC_ALIGN=16

 

UCX_MEM_EVENTS=y

 

UCX_MEM_MMAP_HOOK_MODE=bistro

 

UCX_MEM_MALLOC_HOOKS=y

 

UCX_MEM_MALLOC_RELOC=y

 

UCX_MEM_CUDA_RELOC=y

 

UCX_MEM_DYNAMIC_MMAP_THRESH=y

 

UCX_MEM_DLOPEN_PROCESS_RPATH=y

 

 

 

 

UCX_POSIX_HUGETLB_MODE=try

 

UCX_POSIX_DIR=/dev/shm

 

UCX_POSIX_USE_PROC_LINK=y

 

 

 

 

UCX_MM_ALLOC=md,mmap,heap

 

UCX_MM_FAILURE=ERROR

 

UCX_MM_MAX_NUM_EPS=inf

 

UCX_MM_BW=12179.00MBps

 

UCX_MM_FIFO_SIZE=64

 

UCX_MM_SEG_SIZE=8256

 

UCX_MM_FIFO_RELEASE_FACTOR=0.500

 

UCX_MM_RX_MAX_BUFS=-1

 

UCX_MM_RX_BUFS_GROW=512

 

UCX_MM_FIFO_HUGETLB=n

 

UCX_MM_FIFO_ELEM_SIZE=128

 

UCX_MM_FIFO_MAX_POLL=16

 

 

 

 

UCX_SYSV_HUGETLB_MODE=try

 

 

 

 

UCX_MM_ALLOC=md,mmap,heap

 

UCX_MM_FAILURE=ERROR

 

UCX_MM_MAX_NUM_EPS=inf

 

UCX_MM_BW=12179.00MBps

 

UCX_MM_FIFO_SIZE=64

 

UCX_MM_SEG_SIZE=8256

 

UCX_MM_FIFO_RELEASE_FACTOR=0.500

 

UCX_MM_RX_MAX_BUFS=-1

 

UCX_MM_RX_BUFS_GROW=512

 

UCX_MM_FIFO_HUGETLB=n

 

UCX_MM_FIFO_ELEM_SIZE=128

 

UCX_MM_FIFO_MAX_POLL=16

 

 

 

 

UCX_SELF_ALLOC=huge,thp,md,mmap,heap

 

UCX_SELF_FAILURE=ERROR

 

UCX_SELF_MAX_NUM_EPS=inf

 

UCX_SELF_SEG_SIZE=8K

 

 

 

 

UCX_TCP_ALLOC=huge,thp,md,mmap,heap

 

UCX_TCP_FAILURE=ERROR

 

UCX_TCP_MAX_NUM_EPS=256

 

UCX_TCP_TX_SEG_SIZE=8K

 

UCX_TCP_RX_SEG_SIZE=64K

 

UCX_TCP_MAX_IOV=6

 

UCX_TCP_SENDV_THRESH=2K

 

UCX_TCP_PREFER_DEFAULT=y

 

UCX_TCP_PUT_ENABLE=y

 

UCX_TCP_CONN_NB=n

 

UCX_TCP_MAX_POLL=16

 

UCX_TCP_MAX_CONN_RETRIES=25

 

UCX_TCP_NODELAY=y

 

UCX_TCP_SNDBUF=auto

 

UCX_TCP_RCVBUF=auto

 

UCX_TCP_SYN_CNT=auto

 

UCX_TCP_TX_MAX_BUFS=-1

 

UCX_TCP_TX_BUFS_GROW=8

 

UCX_TCP_RX_MAX_BUFS=-1

 

UCX_TCP_RX_BUFS_GROW=8

 

 

 

 

UCX_TCP_CM_PRIV_DATA_LEN=2K

 

UCX_TCP_CM_SNDBUF=auto

 

UCX_TCP_CM_RCVBUF=auto

 

UCX_TCP_CM_SYN_CNT=auto

 

 

 

 

UCX_SOCKCM_ALLOC=huge,thp,md,mmap,heap

 

UCX_SOCKCM_FAILURE=ERROR

 

UCX_SOCKCM_MAX_NUM_EPS=inf

 

UCX_SOCKCM_BACKLOG=1024

 

 

 

 

 

 

 

UCX_NET_DEVICES=all

 

UCX_SHM_DEVICES=all

 

UCX_ACC_DEVICES=all

 

UCX_SELF_DEVICES=all

 

UCX_TLS=all

 

UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap

 

UCX_SOCKADDR_TLS_PRIORITY=rdmacm,sockcm

 

UCX_SOCKADDR_AUX_TLS=ud

 

UCX_WARN_INVALID_CONFIG=y

 

UCX_BCOPY_THRESH=0

 

UCX_RNDV_THRESH=auto

 

UCX_RNDV_SEND_NBR_THRESH=256K

 

UCX_RNDV_THRESH_FALLBACK=inf

 

UCX_RNDV_PERF_DIFF=1.000

 

UCX_MULTI_LANE_MAX_RATIO=10.000

 

UCX_MAX_EAGER_RAILS=1

 

UCX_MAX_RNDV_RAILS=2

 

UCX_RNDV_SCHEME=auto

 

UCX_RKEY_PTR_SEG_SIZE=512K

 

UCX_ZCOPY_THRESH=auto

 

UCX_BCOPY_BW=auto

 

UCX_ATOMIC_MODE=guess

 

UCX_ADDRESS_DEBUG_INFO=n

 

UCX_MAX_WORKER_NAME=32

 

UCX_USE_MT_MUTEX=n

 

UCX_ADAPTIVE_PROGRESS=y

 

UCX_SEG_SIZE=8K

 

UCX_TM_THRESH=1K

 

UCX_TM_MAX_BB_SIZE=1K

 

UCX_TM_FORCE_THRESH=8K

 

UCX_TM_SW_RNDV=n

 

UCX_NUM_EPS=auto

 

UCX_NUM_PPN=auto

 

UCX_RNDV_FRAG_SIZE=512K

 

UCX_RNDV_PIPELINE_SEND_THRESH=inf

 

UCX_MEMTYPE_CACHE=y

 

UCX_FLUSH_WORKER_EPS=y

 

UCX_UNIFIED_MODE=n

 

UCX_SOCKADDR_CM_ENABLE=n

 

UCX_PROTO_ENABLE=n

 

 

 

 

UCX_IB_REG_METHODS=rcache,odp,direct

 

UCX_IB_RCACHE_MEM_PRIO=1000

 

UCX_IB_RCACHE_OVERHEAD=0.18us

 

UCX_IB_RCACHE_ADDR_ALIGN=16

 

UCX_IB_MEM_REG_OVERHEAD=16.00us

 

UCX_IB_MEM_REG_GROWTH=0.00us

 

UCX_IB_FORK_INIT=try

 

UCX_IB_ASYNC_EVENTS=n

 

UCX_IB_ETH_PAUSE_ON=y

 

UCX_IB_ODP_NUMA_POLICY=preferred

 

UCX_IB_ODP_PREFETCH=n

 

UCX_IB_ODP_MAX_SIZE=auto

 

UCX_IB_DEVICE_SPECS=

 

UCX_IB_PREFER_NEAREST_DEVICE=y

 

UCX_IB_INDIRECT_ATOMIC=y

 

UCX_IB_GID_INDEX=auto

 

UCX_IB_SUBNET_PREFIX=

 

UCX_IB_GPU_DIRECT_RDMA=try

 

UCX_IB_PCI_BW=

 

UCX_IB_MLX5_DEVX=try

 

UCX_IB_MLX5_DEVX_OBJECTS=rcqp,rcsrq,dct,dcsrq

 

UCX_IB_REG_MT_THRESH=4G

 

UCX_IB_REG_MT_CHUNK=2G

 

UCX_IB_REG_MT_BIND=n

 

UCX_IB_PCI_RELAXED_ORDERING=auto

 

 

 

 

UCX_RC_VERBS_ALLOC=huge,thp,md,mmap,heap

 

UCX_RC_VERBS_FAILURE=ERROR

 

UCX_RC_VERBS_MAX_NUM_EPS=256

 

UCX_RC_VERBS_SEG_SIZE=8256

 

UCX_RC_VERBS_TX_QUEUE_LEN=256

 

UCX_RC_VERBS_TX_MAX_BATCH=16

 

UCX_RC_VERBS_TX_MAX_POLL=16

 

UCX_RC_VERBS_TX_MIN_INLINE=64

 

UCX_RC_VERBS_TX_INLINE_RESP=64

 

UCX_RC_VERBS_TX_MIN_SGE=3

 

UCX_RC_VERBS_TX_MAX_BUFS=-1

 

UCX_RC_VERBS_TX_BUFS_GROW=1024

 

UCX_RC_VERBS_RX_QUEUE_LEN=4095

 

UCX_RC_VERBS_RX_MAX_BATCH=16

 

UCX_RC_VERBS_RX_MAX_POLL=16

 

UCX_RC_VERBS_RX_INLINE=64

 

UCX_RC_VERBS_RX_MAX_BUFS=-1

 

UCX_RC_VERBS_RX_BUFS_GROW=0

 

UCX_RC_VERBS_ADDR_TYPE=auto

 

UCX_RC_VERBS_IS_GLOBAL=n

 

UCX_RC_VERBS_SL=0

 

UCX_RC_VERBS_TRAFFIC_CLASS=auto

 

UCX_RC_VERBS_HOP_LIMIT=255

 

UCX_RC_VERBS_NUM_PATHS=auto

 

UCX_RC_VERBS_ROCE_PATH_FACTOR=1

 

UCX_RC_VERBS_LID_PATH_BITS=0

 

UCX_RC_VERBS_PKEY=auto

 

UCX_RC_VERBS_PATH_MTU=default

 

UCX_RC_VERBS_ENABLE_CUDA_AFFINITY=y

 

UCX_RC_VERBS_MAX_RD_ATOMIC=4

 

UCX_RC_VERBS_TIMEOUT=1000000.00us

 

UCX_RC_VERBS_RETRY_COUNT=7

 

UCX_RC_VERBS_RNR_TIMEOUT=1000.00us

 

UCX_RC_VERBS_RNR_RETRY_COUNT=7

 

UCX_RC_VERBS_FC_ENABLE=y

 

UCX_RC_VERBS_FC_WND_SIZE=512

 

UCX_RC_VERBS_FC_HARD_THRESH=0.250

 

UCX_RC_VERBS_FENCE=auto

 

UCX_RC_VERBS_MAX_GET_ZCOPY=auto

 

UCX_RC_VERBS_TX_NUM_GET_BYTES=inf

 

UCX_RC_VERBS_FC_SOFT_THRESH=0.500

 

UCX_RC_VERBS_TX_CQ_MODERATION=64

 

UCX_RC_VERBS_TX_CQ_LEN=4096

 

UCX_RC_VERBS_MAX_AM_HDR=128

 

UCX_RC_VERBS_TX_MAX_WR=inf

 

 

 

 

UCX_RC_MLX5_ALLOC=huge,thp,md,mmap,heap

 

UCX_RC_MLX5_FAILURE=ERROR

 

UCX_RC_MLX5_MAX_NUM_EPS=256

 

UCX_RC_MLX5_SEG_SIZE=8256

 

UCX_RC_MLX5_TX_QUEUE_LEN=256

 

UCX_RC_MLX5_TX_MAX_BATCH=16

 

UCX_RC_MLX5_TX_MAX_POLL=16

 

UCX_RC_MLX5_TX_MIN_INLINE=64

 

UCX_RC_MLX5_TX_INLINE_RESP=64

 

UCX_RC_MLX5_TX_MIN_SGE=3

 

UCX_RC_MLX5_TX_MAX_BUFS=-1

 

UCX_RC_MLX5_TX_BUFS_GROW=1024

 

UCX_RC_MLX5_RX_QUEUE_LEN=4095

 

UCX_RC_MLX5_RX_MAX_BATCH=16

 

UCX_RC_MLX5_RX_MAX_POLL=16

 

UCX_RC_MLX5_RX_INLINE=64

 

UCX_RC_MLX5_RX_MAX_BUFS=-1

 

UCX_RC_MLX5_RX_BUFS_GROW=0

 

UCX_RC_MLX5_ADDR_TYPE=auto

 

UCX_RC_MLX5_IS_GLOBAL=n

 

UCX_RC_MLX5_SL=0

 

UCX_RC_MLX5_TRAFFIC_CLASS=auto

 

UCX_RC_MLX5_HOP_LIMIT=255

 

UCX_RC_MLX5_NUM_PATHS=auto

 

UCX_RC_MLX5_ROCE_PATH_FACTOR=1

 

UCX_RC_MLX5_LID_PATH_BITS=0

 

UCX_RC_MLX5_PKEY=auto

 

UCX_RC_MLX5_PATH_MTU=default

 

UCX_RC_MLX5_ENABLE_CUDA_AFFINITY=y

 

UCX_RC_MLX5_MAX_RD_ATOMIC=4

 

UCX_RC_MLX5_TIMEOUT=1000000.00us

 

UCX_RC_MLX5_RETRY_COUNT=7

 

UCX_RC_MLX5_RNR_TIMEOUT=1000.00us

 

UCX_RC_MLX5_RNR_RETRY_COUNT=7

 

UCX_RC_MLX5_FC_ENABLE=y

 

UCX_RC_MLX5_FC_WND_SIZE=512

 

UCX_RC_MLX5_FC_HARD_THRESH=0.250

 

UCX_RC_MLX5_FENCE=auto

 

UCX_RC_MLX5_MAX_GET_ZCOPY=auto

 

UCX_RC_MLX5_TX_NUM_GET_BYTES=inf

 

UCX_RC_MLX5_FC_SOFT_THRESH=0.500

 

UCX_RC_MLX5_TX_CQ_MODERATION=64

 

UCX_RC_MLX5_TX_CQ_LEN=4096

 

UCX_RC_MLX5_DM_SIZE=2K

 

UCX_RC_MLX5_DM_COUNT=1

 

UCX_RC_MLX5_MMIO_MODE=auto

 

UCX_RC_MLX5_TX_MAX_BB=inf

 

UCX_RC_MLX5_TM_ENABLE=n

 

UCX_RC_MLX5_TM_LIST_SIZE=1024

 

UCX_RC_MLX5_TM_SEG_SIZE=48K

 

UCX_RC_MLX5_TM_MP_SRQ_ENABLE=try

 

UCX_RC_MLX5_TM_MP_NUM_STRIDES=8

 

UCX_RC_MLX5_EXP_BACKOFF=0

 

UCX_RC_MLX5_CYCLIC_SRQ_ENABLE=try

 

 

 

 

UCX_DC_MLX5_ALLOC=huge,thp,md,mmap,heap

 

UCX_DC_MLX5_FAILURE=ERROR

 

UCX_DC_MLX5_MAX_NUM_EPS=inf

 

UCX_DC_MLX5_SEG_SIZE=8256

 

UCX_DC_MLX5_TX_QUEUE_LEN=128

 

UCX_DC_MLX5_TX_MAX_BATCH=16

 

UCX_DC_MLX5_TX_MAX_POLL=16

 

UCX_DC_MLX5_TX_MIN_INLINE=64

 

UCX_DC_MLX5_TX_INLINE_RESP=64

 

UCX_DC_MLX5_TX_MIN_SGE=3

 

UCX_DC_MLX5_TX_MAX_BUFS=-1

 

UCX_DC_MLX5_TX_BUFS_GROW=1024

 

UCX_DC_MLX5_RX_QUEUE_LEN=4095

 

UCX_DC_MLX5_RX_MAX_BATCH=16

 

UCX_DC_MLX5_RX_MAX_POLL=16

 

UCX_DC_MLX5_RX_INLINE=64

 

UCX_DC_MLX5_RX_MAX_BUFS=-1

 

UCX_DC_MLX5_RX_BUFS_GROW=0

 

UCX_DC_MLX5_ADDR_TYPE=auto

 

UCX_DC_MLX5_IS_GLOBAL=n

 

UCX_DC_MLX5_SL=0

 

UCX_DC_MLX5_TRAFFIC_CLASS=auto

 

UCX_DC_MLX5_HOP_LIMIT=255

 

UCX_DC_MLX5_NUM_PATHS=auto

 

UCX_DC_MLX5_ROCE_PATH_FACTOR=1

 

UCX_DC_MLX5_LID_PATH_BITS=0

 

UCX_DC_MLX5_PKEY=auto

 

UCX_DC_MLX5_PATH_MTU=default

 

UCX_DC_MLX5_ENABLE_CUDA_AFFINITY=y

 

UCX_DC_MLX5_MAX_RD_ATOMIC=4

 

UCX_DC_MLX5_TIMEOUT=1000000.00us

 

UCX_DC_MLX5_RETRY_COUNT=7

 

UCX_DC_MLX5_RNR_TIMEOUT=1000.00us

 

UCX_DC_MLX5_RNR_RETRY_COUNT=7

 

UCX_DC_MLX5_FC_ENABLE=y

 

UCX_DC_MLX5_FC_WND_SIZE=512

 

UCX_DC_MLX5_FC_HARD_THRESH=0.250

 

UCX_DC_MLX5_FENCE=auto

 

UCX_DC_MLX5_MAX_GET_ZCOPY=auto

 

UCX_DC_MLX5_TX_NUM_GET_BYTES=inf

 

UCX_DC_MLX5_DM_SIZE=2K

 

UCX_DC_MLX5_DM_COUNT=1

 

UCX_DC_MLX5_MMIO_MODE=auto

 

UCX_DC_MLX5_TX_MAX_BB=inf

 

UCX_DC_MLX5_TM_ENABLE=n

 

UCX_DC_MLX5_TM_LIST_SIZE=1024

 

UCX_DC_MLX5_TM_SEG_SIZE=48K

 

UCX_DC_MLX5_TM_MP_SRQ_ENABLE=try

 

UCX_DC_MLX5_TM_MP_NUM_STRIDES=8

 

UCX_DC_MLX5_EXP_BACKOFF=0

 

UCX_DC_MLX5_CYCLIC_SRQ_ENABLE=try

 

UCX_DC_MLX5_RX_QUEUE_LEN_INIT=128

 

UCX_DC_MLX5_NUM_DCI=8

 

UCX_DC_MLX5_TX_POLICY=dcs_quota

 

UCX_DC_MLX5_RAND_DCI_SEED=0

 

UCX_DC_MLX5_QUOTA=32

 

UCX_DC_MLX5_COMPACT_AV=y

 

 

 

 

UCX_UD_VERBS_ALLOC=huge,thp,md,mmap,heap

 

UCX_UD_VERBS_FAILURE=ERROR

 

UCX_UD_VERBS_MAX_NUM_EPS=inf

 

UCX_UD_VERBS_SEG_SIZE=8K

 

UCX_UD_VERBS_TX_QUEUE_LEN=256

 

UCX_UD_VERBS_TX_MAX_BATCH=16

 

UCX_UD_VERBS_TX_MAX_POLL=16

 

UCX_UD_VERBS_TX_MIN_INLINE=64

 

UCX_UD_VERBS_TX_INLINE_RESP=0

 

UCX_UD_VERBS_TX_MIN_SGE=3

 

UCX_UD_VERBS_TX_MAX_BUFS=-1

 

UCX_UD_VERBS_TX_BUFS_GROW=1024

 

UCX_UD_VERBS_RX_QUEUE_LEN=4096

 

UCX_UD_VERBS_RX_MAX_BATCH=16

 

UCX_UD_VERBS_RX_MAX_POLL=16

 

UCX_UD_VERBS_RX_INLINE=0

 

UCX_UD_VERBS_RX_MAX_BUFS=-1

 

UCX_UD_VERBS_RX_BUFS_GROW=0

 

UCX_UD_VERBS_ADDR_TYPE=auto

 

UCX_UD_VERBS_IS_GLOBAL=n

 

UCX_UD_VERBS_SL=0

 

UCX_UD_VERBS_TRAFFIC_CLASS=auto

 

UCX_UD_VERBS_HOP_LIMIT=255

 

UCX_UD_VERBS_NUM_PATHS=auto

 

UCX_UD_VERBS_ROCE_PATH_FACTOR=1

 

UCX_UD_VERBS_LID_PATH_BITS=0

 

UCX_UD_VERBS_PKEY=auto

 

UCX_UD_VERBS_PATH_MTU=default

 

UCX_UD_VERBS_ENABLE_CUDA_AFFINITY=y

 

UCX_UD_VERBS_RX_QUEUE_LEN_INIT=128

 

UCX_UD_VERBS_TIMEOUT=300000000.00us

 

UCX_UD_VERBS_TIMER_TICK=10000.00us

 

UCX_UD_VERBS_TIMER_BACKOFF=2.000

 

UCX_UD_VERBS_ASYNC_TIMER_TICK=100000.00us

 

UCX_UD_VERBS_ETH_DGID_CHECK=y

 

UCX_UD_VERBS_MAX_WINDOW=1025

 

UCX_UD_VERBS_RX_ASYNC_MAX_POLL=64

 

 

 

 

UCX_UD_MLX5_ALLOC=huge,thp,md,mmap,heap

 

UCX_UD_MLX5_FAILURE=ERROR

 

UCX_UD_MLX5_MAX_NUM_EPS=inf

 

UCX_UD_MLX5_SEG_SIZE=8K

 

UCX_UD_MLX5_TX_QUEUE_LEN=256

 

UCX_UD_MLX5_TX_MAX_BATCH=16

 

UCX_UD_MLX5_TX_MAX_POLL=16

 

UCX_UD_MLX5_TX_MIN_INLINE=64

 

UCX_UD_MLX5_TX_INLINE_RESP=0

 

UCX_UD_MLX5_TX_MIN_SGE=3

 

UCX_UD_MLX5_TX_MAX_BUFS=-1

 

UCX_UD_MLX5_TX_BUFS_GROW=1024

 

UCX_UD_MLX5_RX_QUEUE_LEN=4096

 

UCX_UD_MLX5_RX_MAX_BATCH=16

 

UCX_UD_MLX5_RX_MAX_POLL=16

 

UCX_UD_MLX5_RX_INLINE=0

 

UCX_UD_MLX5_RX_MAX_BUFS=-1

 

UCX_UD_MLX5_RX_BUFS_GROW=0

 

UCX_UD_MLX5_ADDR_TYPE=auto

 

UCX_UD_MLX5_IS_GLOBAL=n

 

UCX_UD_MLX5_SL=0

 

UCX_UD_MLX5_TRAFFIC_CLASS=auto

 

UCX_UD_MLX5_HOP_LIMIT=255

 

UCX_UD_MLX5_NUM_PATHS=auto

 

UCX_UD_MLX5_ROCE_PATH_FACTOR=1

 

UCX_UD_MLX5_LID_PATH_BITS=0

 

UCX_UD_MLX5_PKEY=auto

 

UCX_UD_MLX5_PATH_MTU=default

 

UCX_UD_MLX5_ENABLE_CUDA_AFFINITY=y

 

UCX_UD_MLX5_RX_QUEUE_LEN_INIT=128

 

UCX_UD_MLX5_TIMEOUT=300000000.00us

 

UCX_UD_MLX5_TIMER_TICK=10000.00us

 

UCX_UD_MLX5_TIMER_BACKOFF=2.000

 

UCX_UD_MLX5_ASYNC_TIMER_TICK=100000.00us

 

UCX_UD_MLX5_ETH_DGID_CHECK=y

 

UCX_UD_MLX5_MAX_WINDOW=1025

 

UCX_UD_MLX5_RX_ASYNC_MAX_POLL=64

 

UCX_UD_MLX5_DM_SIZE=2K

 

UCX_UD_MLX5_DM_COUNT=1

 

UCX_UD_MLX5_MMIO_MODE=auto

 

UCX_UD_MLX5_COMPACT_AV=y

Klaus-Dieter_O_Intel
1,133 Views

Obviously a typo: I_MPI_OFI_PROVIDER=mlx or =verbs


Klaus-Dieter_O_Intel
1,028 Views

Thanks for the information. You wrote

"... this issue only happens when PSM3 is used. I've already tried all supported modes like PSM2, MLX and VERBS. It only happens with PSM3."


I understand that MLX and VERBS both work well.

Does the PSM3 run continue, or does it fail?


Engineering provided the information that the message "n01:rank471.xhpcg_avx2: unknown link width 0x10" comes from PSM3 level. PSM3 tries to identify the speed of NIC using port active width. For some reason your NIC has a width value that is not included in the switch - case calculations on PSM3 side. But PSM3 should not fail in that case, PSM3 just assumes 100Gbps if the width is unknown. Here is link to that part of code (https://github.com/intel/eth-psm3-fi/blob/master/psm3/psm_verbs_ep.c#L2124


Reply