Intel® HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Assert in MPI

frizwi
Beginner
2,178 Views

I'm getting an assert from the Intel MPI library (2021.6.0) as follows:

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x151d52c6abcc]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x151d52644df1]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x14e4f01f8bcc]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x14cb51d87bcc]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b1eb9) [0x151d52313eb9]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x14e4efbd2df1]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x176602) [0x151d521d8602]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b1eb9) [0x14e4ef8a1eb9]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x14cb51761df1]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x1ab82d) [0x151d5220d82d]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x176602) [0x14e4ef766602]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x19d1cc) [0x151d521ff1cc]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x1ab82d) [0x14e4ef79b82d]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b1eb9) [0x14cb51430eb9]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x151d521d37ec]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x19d1cc) [0x14e4ef78d1cc]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x176602) [0x14cb512f5602]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b389f) [0x151d5231589f]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x14e4ef7617ec]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x1ab82d) [0x14cb5132a82d]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x6d8895) [0x151d5273a895]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b389f) [0x14e4ef8a389f]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x6d7c10) [0x151d52739c10]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x6d8895) [0x14e4efcc8895]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x15145f730bcc]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x19d1cc) [0x14cb5131c1cc]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x14cb512f07ec]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x6d7c10) [0x14e4efcc7c10]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x29457d) [0x14e4ef88457d]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b389f) [0x14cb5143289f]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x15145f10adf1]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b2b52) [0x14e4ef8a2b52]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x6d8895) [0x14cb51857895]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(MPI_Win_create+0x3dc) [0x14e4efe969dc]
/apps/intel-mpi/2021.5.1/lib/release/libmpi.so.12(+0x6d7c10) [0x14cb51856c10]

Which seems to be originating from the MPI_Win_create call. I'm not exactly sure what's triggering it but seems to be related to creating Windows that do not expose any memory. i.e. there doesn't seem to be any limit to creating Windows with distinct pointers and sizes but a few hundred of something that does not expose memory triggers the above.

So, firstly are the following valid in Intel MPI (they are with OpenMPI)?:

  1. MPI_Win_create(NULL, 0, sizeof(int) ....)
  2. MPI_Win_create(&dummy, 0, sizeof(int) ...)
  3. MPI_Win_create(&dummy, 1*sizeof(int), sizeof(int) ...)

Where dummy is defined as a global int.

For cases where a process does not expose any memory, I am using 1. (I've tried the constant MPI_BOTTOM but get the same assert), which causes the assert. So then I tried 2., to trick it but that also trips up, so I am now finally using 3. which does not cause the asserts but doesn't seem right

Thanks for any advice

0 Kudos
16 Replies
HemanthCH_Intel
Moderator
2,148 Views

Hi,


Thank you for posting in Intel Communities.


Could you please provide the following details to investigate more on your issue?

1. OS details and CPU details.

2. Complete reproducer code and steps to reproduce your issue?

3. MPI Library version(2021.6 /2021.5.1). You can find the MPI Library version using the below Command:

mpirun --version

4. Provide the complete debug log using the below command:

I_MPI_DEBUG=10 mpirun -n <num of processess> -ppn <process per node>./a.out



Thanks & Regards,

Hemanth



0 Kudos
frizwi
Beginner
2,125 Views

Hi Hemanth,

This will do it:

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

#define NVARS 600

int main(int argc, char **argv)
{
  int i;
  MPI_Win *mpi_wins;

  MPI_Init(&argc, &argv);
  
  mpi_wins = (MPI_Win *)malloc(NVARS * sizeof(MPI_Win));
  for (i=0; i<NVARS; i++)
    MPI_Win_create(NULL, 0, sizeof(int), MPI_INFO_NULL,
		   MPI_COMM_WORLD, &mpi_wins[i]);

  MPI_Finalize();
  
  return(0);
  
}

[ffr599@gadi-login-08 ems-sim-w2w]$ uname -a
Linux gadi-login-08.gadi.nci.org.au 4.18.0-348.20.1.el8.nci.x86_64 #1 SMP Wed Mar 16 11:37:35 AEDT 2022 x86_64 x86_64 x86_64 GNU/Linux

Running on this HPC cluster: https://nci.org.au/our-systems/hpc-systems

OS: Rocky Linux 8

CPU: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz

[ffr599@gadi-login-08 ems-sim-w2w]$ icc --version
icc (ICC) 2021.6.0 20220226
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.

[ffr599@gadi-login-08 ems-sim-w2w]$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)
Copyright 2003-2022, Intel Corporation.

Output of the following command:

I_MPI_DEBUG=10 mpirun -np 4 ./test_win

[0] MPI startup(): Intel(R) MPI Library, Version 2021.6  Build 20220227 (id: 28877f3f32)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (1452 MB per rank) * (4 local ranks) = 5811 MB total
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/apps/intel-mpi/2021.6.0/etc/tuning_clx-ap_shm-ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/apps/intel-mpi/2021.6.0/etc/tuning_clx-ap_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 20 (TAG_UB value: 1048575) 
[0] MPI startup(): source bits available: 21 (Maximal number of rank: 2097151) 
[0] MPI startup(): Rank    Pid      Node name                      Pin cpu
[0] MPI startup(): 0       2184546  gadi-login-08.gadi.nci.org.au  {0,1,2,3,7,8,12,13,14,18,19,20}
[0] MPI startup(): 1       2184547  gadi-login-08.gadi.nci.org.au  {4,5,6,9,10,11,15,16,17,21,22,23}
[0] MPI startup(): 2       2184548  gadi-login-08.gadi.nci.org.au  {24,25,26,27,31,32,36,37,38,42,43,44}
[0] MPI startup(): 3       2184549  gadi-login-08.gadi.nci.org.au  {28,29,30,33,34,35,39,40,41,45,46,47}
[0] MPI startup(): I_MPI_LIBRARY_KIND=release
[0] MPI startup(): I_MPI_CC=icc
[0] MPI startup(): I_MPI_CXX=icpc
[0] MPI startup(): I_MPI_F90=ifort
[0] MPI startup(): I_MPI_F77=ifort
[0] MPI startup(): I_MPI_ROOT=/apps/intel-mpi/2021.6.0
[0] MPI startup(): I_MPI_LINK=opt
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP_EXEC=/opt/pbs/default/bin/pbs_tmrsh
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_BRANCH_COUNT=0
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=rsh
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2279: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2279: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2279: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2279: comm->shm_numa_layout[my_numa_node].base_addr
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f938647952c]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f9385dfdc91]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x264cc6) [0x7f9385b38cc6]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x16a7a2) [0x7f9385a3e7a2]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x19e9cd) [0x7f9385a729cd]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x190598) [0x7f9385a64598]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x165780) [0x7f9385a39780]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x26672f) [0x7f9385b3a72f]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x632df8) [0x7f9385f06df8]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x247467) [0x7f9385b1b467]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x2659e2) [0x7f9385b399e2]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPI_Win_create+0x3dc) [0x7f9386070a0c]
./test_win() [0x400f14]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x7f93847ef493]
./test_win() [0x400dde]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7fb745c4e52c]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7fb7455d2c91]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x264cc6) [0x7fb74530dcc6]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x16a7a2) [0x7fb7452137a2]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x19e9cd) [0x7fb7452479cd]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x190598) [0x7fb745239598]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x165780) [0x7fb74520e780]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x26672f) [0x7fb74530f72f]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x632df8) [0x7fb7456dbdf8]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x247467) [0x7fb7452f0467]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x2659e2) [0x7fb74530e9e2]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPI_Win_create+0x3dc) [0x7fb745845a0c]
./test_win() [0x400f14]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x7fb743fc4493]
./test_win() [0x400dde]
Abort(1) on node 3: Internal error
Abort(1) on node 0: Internal error
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f9d4dcda52c]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f9d4d65ec91]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x264cc6) [0x7f9d4d399cc6]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x16a7a2) [0x7f9d4d29f7a2]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x19e9cd) [0x7f9d4d2d39cd]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x190598) [0x7f9d4d2c5598]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x165780) [0x7f9d4d29a780]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x26672f) [0x7f9d4d39b72f]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x632df8) [0x7f9d4d767df8]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x247467) [0x7f9d4d37c467]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x2659e2) [0x7f9d4d39a9e2]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPI_Win_create+0x3dc) [0x7f9d4d8d1a0c]
./test_win() [0x400f14]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x7f9d4c050493]
./test_win() [0x400dde]
Abort(1) on node 1: Internal error
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7fc5eacc752c]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7fc5ea64bc91]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x264cc6) [0x7fc5ea386cc6]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x16a7a2) [0x7fc5ea28c7a2]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x19e9cd) [0x7fc5ea2c09cd]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x190598) [0x7fc5ea2b2598]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x165780) [0x7fc5ea287780]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x26672f) [0x7fc5ea38872f]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x632df8) [0x7fc5ea754df8]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x247467) [0x7fc5ea369467]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(+0x2659e2) [0x7fc5ea3879e2]
/apps/intel-mpi/2021.6.0/lib/release/libmpi.so.12(MPI_Win_create+0x3dc) [0x7fc5ea8bea0c]
./test_win() [0x400f14]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x7fc5e903d493]
./test_win() [0x400dde]
Abort(1) on node 2: Internal error
[ffr599@gadi-login-08 ems-sim-w2w]$ 

 

0 Kudos
frizwi
Beginner
2,089 Views

Hi @HemanthCH_Intel , any updates on this?

Thanks!

0 Kudos
HemanthCH_Intel
Moderator
2,075 Views

Hi, 


we tried to run the code at our end using the below specifications, but we couldn't reproduce your issue.

Os details: Rocky Linux 8

MPI version: 2021.6

Job Scheduler: Slurm

FI_Provider: MLX


Could you please confirm, if you are using PBS Job scheduler?


Thanks & Regards,

Hemanth



0 Kudos
Ben3
Beginner
2,050 Views

Hi Hemanth,

I'm one of the admins for the cluster in question here. I can reproduce this issue just in a plain SSH session, no batch job required. It also only seems to be occur for -np 3 or -np 4 -- otherwise, it runs fine with either more or less ranks.

It's also not the first call to MPI_Win_create that generates the assertion failure -- it's only when i=454. And if I change the test code to actually expose something rather than just passing a NULL address and 0 for the size, then it works for all rank counts.

Given the particular assertion that's failing, I'm wondering if it's hardware dependent, e.g. the association of the ranks with cores and NUMA domains, and that's why you can't reproduce it on your end.

Let us know if there's any other information that would be helpful to debug this, or if it would be better submitted via IPS.

Thanks,
Ben

0 Kudos
HemanthCH_Intel
Moderator
2,020 Views

Hi,

 

Thanks for your information.

 

Could you please provide the CPU information by using the below command:

lscpu

 

Thanks & Regards,

Hemanth

 

0 Kudos
frizwi
Beginner
2,008 Views

Hi Hemanth,

Here it is:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  1
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        4
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
Stepping:            7
CPU MHz:             2900.000
CPU max MHz:         3900.0000
CPU min MHz:         1200.0000
BogoMIPS:            5800.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-3,7,8,12-14,18-20
NUMA node1 CPU(s):   4-6,9-11,15-17,21-23
NUMA node2 CPU(s):   24-27,31-33,37-39,43,44
NUMA node3 CPU(s):   28-30,34-36,40-42,45-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
0 Kudos
HemanthCH_Intel
Moderator
1,993 Views

Hi,

 

We are working on your issue and will get back to you soon.

 

Thanks & Regards,

Hemanth

 

0 Kudos
HemanthCH_Intel
Moderator
1,919 Views

Hi,


Could you please provide the OS details(sub version of the OS) using the below command:


$cat /etc/os-release


Thanks & Regards,

Hemanth


0 Kudos
frizwi
Beginner
1,910 Views

Hi Hemanth.

Here it is.

Cheers,

-Farhan

 

[ffr599@gadi-login-07 ~]$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.6 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.6 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8"
0 Kudos
HemanthCH_Intel
Moderator
1,861 Views

Hi,


We are working on your issue internally and will get back to you soon.


Thanks & Regards,

Hemanth


0 Kudos
DrAmarpal_K_Intel
1,852 Views

Hi frizwi/Ben,


Can you please rerun your application with the below environment variable and share your findings?

$ I_MPI_SHM_HEAP_VSIZE=0




0 Kudos
frizwi
Beginner
1,806 Views

With I_MPI_SHM_HEAP_VSIZE=0, it works with no errors on my test program.

 

Next I'll try my full application to see how that goes. What are the consequences of setting this env?

0 Kudos
DrAmarpal_K_Intel
1,777 Views

Hello frizwi.


With I_MPI_SHM_HEAP_VSIZE=0, the shared memory allocator is disabled. Some implementations of collective operations rely on SHM heap. This setting will therefore disable such algorithms, which might possibly result in a performance hit. The performance hit, if any, will depend on the nature of your application.




0 Kudos
DrAmarpal_K_Intel
1,749 Views

Hi frizwi,


Just wanted to check if there is anything else we could help you with before closing this thread?


0 Kudos
DrAmarpal_K_Intel
1,706 Views

In light of the workaround provided (& confirmed) and subsequent inactivity on this thread, this issue is assumed to be resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

 

0 Kudos
Reply