- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
we are running an FEM software called OpenGeoSys on our machines. All machines are, for the most part, set up exactly in the same way. One machine is throwing mpi-related errors. To make sure that this has nothing to do with the software itself, I followed this Tutorial.
I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -n 4 ./mpi_hello_world
results in (excerpt, complete logs attached) - hostname is replaced by <hostname_removed>:
...
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 12
libfabric:61673:verbs:fabric:vrb_get_qp_cap():479<info> ibv_create_qp: Cannot allocate memory(12)
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 12
libfabric:61674:verbs:fabric:vrb_get_qp_cap():479<info> ibv_create_qp: Cannot allocate memory(12)
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
libfabric:61673:verbs:fabric:vrb_get_qp_cap():479<info> ibv_create_qp: Invalid argument(22)
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 12
libfabric:61671:verbs:fabric:vrb_get_qp_cap():479<info> ibv_create_qp: Cannot allocate memory(12)
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
libfabric:61671:verbs:fabric:vrb_get_qp_cap():479<info> ibv_create_qp: Invalid argument(22)
libfabric:61671:ofi_mrail:core:fi_param_get_():281<info> variable config=<not set>
libfabric:61671:ofi_mrail:core:fi_param_get_():281<info> variable addr=<not set>
libfabric:61671:ofi_mrail:core:fi_param_get_():281<info> variable addr_strc=<not set>
libfabric:61671:ofi_mrail:core:mrail_parse_env_vars():116<info> unable to read FI_OFI_MRAIL_ADDR env variable
libfabric:61671:core:core:ofi_register_provider():465<info> registering provider: ofi_mrail (113.20)
...
[qelr_create_cq:260]create cq: failed with rc = 22
<hostname_removed>:rank0: PSM3 can't open nic unit: 0 (err=23)
libfabric:61671:psm3:core:psmx3_trx_ctxt_alloc():309<warn> psm2_ep_open returns 23, errno=22
libfabric:61671:psm3:domain:psmx3_domain_close():185<info> refcnt=0
libfabric:61671:psm3:core:psmx3_fabric_close():48<info> refcnt=0
<hostname_removed>:rank0.mpi_hello_world: Unable to create send CQ of size 5080 on qedr0: Invalid argument
<hostname_removed>:rank0.mpi_hello_world: Unable to initialize verbs
libfabric:61671:psm3:core:psmx3_fabric():89<info>
libfabric:61671:core:core:fi_fabric_():1264<info> Opened fabric: psm3
libfabric:61671:psm3:domain:psmx3_domain_open():307<info>
libfabric:61671:psm3:core:fi_param_get_():281<info> variable lock_level=<not set>
libfabric:61671:psm3:core:psmx3_init_tag_layout():147<info> tag layout already set opened domain.
libfabric:61671:psm3:core:psmx3_init_tag_layout():196<info> use tag64: tag_mask: FFFFFFFFFFFFFFFF, data_mask: 0FFFFFFF
libfabric:61671:psm3:av:psmx3_av_open():1065<info> FI_AV_MAP asked, but force FI_AV_TABLE for multi-EP support
libfabric:61671:psm3:core:psmx3_trx_ctxt_alloc():298<info> uuid: 548581BC-F40F-6482-555B-144C79CBFC40
libfabric:61671:psm3:core:psmx3_trx_ctxt_alloc():303<info> ep_open_opts: unit=1 port=0
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: <hostname_removed>
Local device: qedr2
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
...
Some more details which might be helpful: We are running mpi only a single node. We do not use Infiniband and there is no Infiniband hardware installed in our machines.
Output of some commands which might helpful:
mpirun -n 3 hostname
<hostname_removed>
<hostname_removed>
<hostname_removed>
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6234 CPU @ 3.30GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 4
Stepping: 7
BogoMIPS: 6600.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pc
lmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd
mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_
total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Caches (sum of all):
L1d: 1 MiB (32 instances)
L1i: 1 MiB (32 instances)
L2: 32 MiB (32 instances)
L3: 99 MiB (4 instances)
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60
NUMA node1 CPU(s): 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61
NUMA node2 CPU(s): 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62
NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63
Vulnerabilities:
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Mitigation; Enhanced IBRS
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Mitigation; TSX disabled
cat /etc/os-release
NAME="SLES"
VERSION="15-SP4"
VERSION_ID="15.4"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp4"
DOCUMENTATION_URL="https://documentation.suse.com/"
Thanks a lot!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in Intel Communities.
Could you please provide the following details, so that we can reproduce your issue at our end:
1. The Intel oneAPI toolkit and Intel MPI library version
2. Which libfabric provider are you using?
>>>One machine is throwing mpi-related errors.
Could you please mention and elaborate your MPI error in detail?
We have tried on Ubuntu 20.04LTS, we were able to receive output. Please check the screenshot below along with the log file.
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
sorry for a bit of confusion. I accidentally posted the openmpi4 logs. I was experimenting with different MPI versions. But they all result in the same error. Attached are the correct logs.
>>> 1. The Intel oneAPI toolkit and Intel MPI library version
- The Intel oneAPI toolkit version: none (at the moment we are only using MPI and, unrelated, MKL from the tookit)
- Intel MPI library version: 2021.9.0
>>> 2. Which libfabric provider are you using?
echo $FI_PROVIDER_PATH
/opt/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov:/usr/lib64/libfabric
>>>Could you please mention and elaborate your MPI error in detail?
I am not quite sure where to start, so, let us start with the error I comprehend the most:
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory. This typically can indicate that the
memlock limits are set too low. For most HPC installations, the
memlock limits should be set to "unlimited". The failure occured
here:
Local host: <hostname_removed>
OMPI source: btl_openib.c:799
Function: opal_free_list_init()
Device: qedr0
Memlock limit: 65536
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
The memlock limit stated should not apply. The user executing mpirun has unlimited memlock and on the other machines mpi does not throw this error. Even, when I run the hello world example as root, the error occurs.
Maybe this is the wrong place to discuss this error as it is not only occuring when using Intel MPI. It could just be a security configuration issue or incompatible hardware. Even a clean installation of the OS did not fix this.
Let me know, if this is something you can help out with or maybe you can point me in the right direction.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have informed the concerned development team. We will get back to you soon.
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for your patience. Could you please try compiling the sample with Intel compilers(i.e. mpiicc) as the latest log file you provided is executed with Open MPI?
Also if you are using any job scheduler could you please provide it's details?
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you. Could you please respond to my previous reply.
Thanks & Regards,
Shaik Rabiya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you. If you need any information, you can post a new question on community as this thread will no longer be monitored by Intel.
Thanks & Regards,
Shaik Rabiya

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page