Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2263 Discussions

hello_world_mpi: multiple errors potentially centered around "failed on ibv_cmd_create_qp"

kkessler
Beginner
4,106 Views

Hello,

we are running an FEM software called OpenGeoSys on our machines. All machines are, for the most part, set up exactly in the same way. One machine is throwing mpi-related errors. To make sure that this has nothing to do with the software itself, I followed this Tutorial.

I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -n 4 ./mpi_hello_world

results in (excerpt, complete logs attached) - hostname is replaced by <hostname_removed>:

...
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 12
libfabric:61673:verbs:fabric:vrb_get_qp_cap():479<info> ibv_create_qp: Cannot allocate memory(12)
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 12
libfabric:61674:verbs:fabric:vrb_get_qp_cap():479<info> ibv_create_qp: Cannot allocate memory(12)
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
libfabric:61673:verbs:fabric:vrb_get_qp_cap():479<info> ibv_create_qp: Invalid argument(22)
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 12
libfabric:61671:verbs:fabric:vrb_get_qp_cap():479<info> ibv_create_qp: Cannot allocate memory(12)
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
libfabric:61671:verbs:fabric:vrb_get_qp_cap():479<info> ibv_create_qp: Invalid argument(22)
libfabric:61671:ofi_mrail:core:fi_param_get_():281<info> variable config=<not set>
libfabric:61671:ofi_mrail:core:fi_param_get_():281<info> variable addr=<not set>
libfabric:61671:ofi_mrail:core:fi_param_get_():281<info> variable addr_strc=<not set>
libfabric:61671:ofi_mrail:core:mrail_parse_env_vars():116<info> unable to read FI_OFI_MRAIL_ADDR env variable
libfabric:61671:core:core:ofi_register_provider():465<info> registering provider: ofi_mrail (113.20)

...

[qelr_create_cq:260]create cq: failed with rc = 22
<hostname_removed>:rank0: PSM3 can't open nic unit: 0 (err=23)
libfabric:61671:psm3:core:psmx3_trx_ctxt_alloc():309<warn> psm2_ep_open returns 23, errno=22
libfabric:61671:psm3:domain:psmx3_domain_close():185<info> refcnt=0
libfabric:61671:psm3:core:psmx3_fabric_close():48<info> refcnt=0
<hostname_removed>:rank0.mpi_hello_world: Unable to create send CQ of size 5080 on qedr0: Invalid argument
<hostname_removed>:rank0.mpi_hello_world: Unable to initialize verbs
libfabric:61671:psm3:core:psmx3_fabric():89<info>
libfabric:61671:core:core:fi_fabric_():1264<info> Opened fabric: psm3
libfabric:61671:psm3:domain:psmx3_domain_open():307<info>
libfabric:61671:psm3:core:fi_param_get_():281<info> variable lock_level=<not set>
libfabric:61671:psm3:core:psmx3_init_tag_layout():147<info> tag layout already set opened domain.
libfabric:61671:psm3:core:psmx3_init_tag_layout():196<info> use tag64: tag_mask: FFFFFFFFFFFFFFFF, data_mask: 0FFFFFFF
libfabric:61671:psm3:av:psmx3_av_open():1065<info> FI_AV_MAP asked, but force FI_AV_TABLE for multi-EP support
libfabric:61671:psm3:core:psmx3_trx_ctxt_alloc():298<info> uuid: 548581BC-F40F-6482-555B-144C79CBFC40
libfabric:61671:psm3:core:psmx3_trx_ctxt_alloc():303<info> ep_open_opts: unit=1 port=0
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           <hostname_removed>
  Local device:         qedr2
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------

...

 

Some more details which might be helpful: We are running mpi only a single node. We do not use Infiniband and there is no Infiniband hardware installed in our machines.

Output of some commands which might helpful:

mpirun -n 3 hostname
<hostname_removed>
<hostname_removed>
<hostname_removed>
lscpu
Architecture:           x86_64
  CPU op-mode(s):       32-bit, 64-bit
  Address sizes:        46 bits physical, 48 bits virtual
  Byte Order:           Little Endian
CPU(s):                 64
  On-line CPU(s) list:  0-63
Vendor ID:              GenuineIntel
  Model name:           Intel(R) Xeon(R) Gold 6234 CPU @ 3.30GHz
    CPU family:         6
    Model:              85
    Thread(s) per core: 2
    Core(s) per socket: 8
    Socket(s):          4
    Stepping:           7
    BogoMIPS:           6600.00
    Flags:              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pc
                        lmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd
                        mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_
                        total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Caches (sum of all):
  L1d:                  1 MiB (32 instances)
  L1i:                  1 MiB (32 instances)
  L2:                   32 MiB (32 instances)
  L3:                   99 MiB (4 instances)
NUMA:
  NUMA node(s):         4
  NUMA node0 CPU(s):    0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60
  NUMA node1 CPU(s):    1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61
  NUMA node2 CPU(s):    2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62
  NUMA node3 CPU(s):    3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63
Vulnerabilities:
  Itlb multihit:        KVM: Mitigation: VMX unsupported
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:             Mitigation; Enhanced IBRS
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                Not affected
  Tsx async abort:      Mitigation; TSX disabled
cat /etc/os-release
NAME="SLES"
VERSION="15-SP4"
VERSION_ID="15.4"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp4"
DOCUMENTATION_URL="https://documentation.suse.com/"

 Thanks a lot!

0 Kudos
6 Replies
RabiyaSK_Intel
Employee
4,068 Views

Hi,

 

Thanks for posting in Intel Communities.

 

Could you please provide the following details, so that we can reproduce your issue at our end:

1. The Intel oneAPI toolkit and Intel MPI library version

2. Which libfabric provider are you using?

 

>>>One machine is throwing mpi-related errors.

Could you please mention and elaborate your MPI error in detail?

 

We have tried on Ubuntu 20.04LTS, we were able to receive output. Please check the screenshot below along with the log file.

RabiyaSK_Intel_0-1685526909092.png

 

Thanks & Regards,

Shaik Rabiya

 

0 Kudos
kkessler
Beginner
4,048 Views

Hi,

 

sorry for a bit of confusion. I accidentally posted the openmpi4 logs. I was experimenting with different MPI versions. But they all result in the same error. Attached are the correct logs.

 

>>> 1. The Intel oneAPI toolkit and Intel MPI library version

- The Intel oneAPI toolkit version: none (at the moment we are only using MPI and, unrelated, MKL from the tookit)

- Intel MPI library version: 2021.9.0

 

>>> 2. Which libfabric provider are you using?

echo $FI_PROVIDER_PATH
/opt/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov:/usr/lib64/libfabric

>>>Could you please mention and elaborate your MPI error in detail?

I am not quite sure where to start, so, let us start with the error I comprehend the most:

--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    <hostname_removed>
  OMPI source:   btl_openib.c:799
  Function:      opal_free_list_init()
  Device:        qedr0
  Memlock limit: 65536

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------

The memlock limit stated should not apply. The user executing mpirun has unlimited memlock and on the other machines mpi does not throw this error. Even, when I run the hello world example as root, the error occurs.

 

Maybe this is the wrong place to discuss this error as it is not only occuring when using Intel MPI. It could just be a security configuration issue or incompatible hardware. Even a clean installation of the OS did not fix this.

Let me know, if this is something you can help out with or maybe you can point me in the right direction.

 

Thanks.

 

 

 

 

 

 

0 Kudos
RabiyaSK_Intel
Employee
3,980 Views

Hi,


We have informed the concerned development team. We will get back to you soon.


Thanks & Regards,

Shaik Rabiya


0 Kudos
RabiyaSK_Intel
Employee
3,922 Views

Hi,

 

Thank you for your patience. Could you please try compiling the sample with Intel compilers(i.e. mpiicc) as the latest log file you provided is executed with Open MPI?

 

Also if you are using any job scheduler could you please provide it's details?

 

Thanks & Regards,

Shaik Rabiya


0 Kudos
RabiyaSK_Intel
Employee
3,867 Views

Hi,


We haven't heard back from you. Could you please respond to my previous reply.


Thanks & Regards,

Shaik Rabiya


0 Kudos
RabiyaSK_Intel
Employee
3,822 Views

Hi,


We haven't heard back from you. If you need any information, you can post a new question on community as this thread will no longer be monitored by Intel.


Thanks & Regards,

Shaik Rabiya


0 Kudos
Reply