Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2167 Discussions

Assertion failed with intel-hpckit

Abdelghany
Novice
1,017 Views

Hi, I recently installed intel-hpckit and it worked well with some codes. But now I face an assert issue when I try to run BerkeleyGW.
I tried to use different CPUs (4, 9, 16, 19) and get the same error.

 

 

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2255: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2255: comm->shm_numa_layout[my_numa_node].base_addr
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f2aab0e306c]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f2aaaa8cf01]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x38e694) [0x7f2aaa7c7694]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x221c66) [0x7f2aaa65ac66]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x256d8c) [0x7f2aaa68fd8c]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x26d930) [0x7f2aaa6a6930]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x23e7a1) [0x7f2aaa6777a1]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7fddb4e2406c]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7fddb47cdf01]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x38e694) [0x7fddb4508694]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x221c66) [0x7fddb439bc66]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x256d8c) [0x7fddb43d0d8c]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x26d930) [0x7fddb43e7930]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x23e7a1) [0x7fddb43b87a1]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x21cce3) [0x7fddb4396ce3]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x392daa) [0x7fddb450cdaa]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(MPI_Bcast+0x417) [0x7fddb42fd917]
epsilon.cplx.x() [0x2f13112]
epsilon.cplx.x() [0x6550c7]
epsilon.cplx.x() [0x5e1cc6]
epsilon.cplx.x() [0x5ad5a8]
epsilon.cplx.x() [0x436ae4]
epsilon.cplx.x() [0x504e66]
epsilon.cplx.x() [0x4f6a04]
epsilon.cplx.x() [0x409922]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x21cce3) [0x7f2aaa655ce3]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(+0x392daa) [0x7f2aaa7cbdaa]
/opt/intel/oneapi/mpi/2021.11/lib/libmpi.so.12(MPI_Bcast+0x417) [0x7f2aaa5bc917]
epsilon.cplx.x() [0x2f13112]
epsilon.cplx.x() [0x6550c7]
epsilon.cplx.x() [0x5e1cc6]
epsilon.cplx.x() [0x5ad5a8]
epsilon.cplx.x() [0x436ae4]
epsilon.cplx.x() [0x504e66]
epsilon.cplx.x() [0x4f6a04]
epsilon.cplx.x() [0x409922]
/lib64/libc.so.6(__libc_start_main+0xe5) [0x7f2aa985bd85]
epsilon.cplx.x() [0x409829]
/lib64/libc.so.6(__libc_start_main+0xe5) [0x7fddb359cd85]
epsilon.cplx.x() [0x409829]
Abort(1) on node 2: Internal error
Abort(1) on node 0: Internal error

 

Below is the CPU information and OS details:

 

 

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              20
On-line CPU(s) list: 0-19
Thread(s) per core:  1
Core(s) per socket:  12
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel(R) Corporation
CPU family:          6
Model:               151
Model name:          12th Gen Intel(R) Core(TM) i7-12700
BIOS Model name:     12th Gen Intel(R) Core(TM) i7-12700
Stepping:            2
CPU MHz:             2100.000
CPU max MHz:         4900.0000
CPU min MHz:         800.0000
BogoMIPS:            4224.00
Virtualization:      VT-x
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            25600K
NUMA node0 CPU(s):   0-19
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr flush_l1d arch_capabilities

 

 

 

NAME="Rocky Linux"
VERSION="8.8 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.8 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.8"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"

 

 

I saw some similar problems and tried some solutions such as updating the Toolkit (sudo dnf upgrade intel-hpckit) and using the command I_MPI_SHM_HEAP_VSIZE=0 but I still get the same error.

 

Thanks, and regards,

 

Ragab

 

 

 

 

Labels (1)
0 Kudos
3 Replies
VeenaJ_Intel
Moderator
948 Views

Hi,

 

Thanks for posting in Intel communities!

 

Kindly, provide the detailed recreation steps for this issue.

 

Regards,

Veena

 

0 Kudos
Abdelghany
Novice
914 Views

Hi Veena,

 

Thank you for your reply.

I am trying to do GW calculations using Quantum Espresso and BerkeleyGW codes. To do so, I executed the following commands:

 

1- mpirun -np 16 pw.x < scf.in > scf.out
2- mpirun -np 16 pw.x < WFN.in > WFN.out
2- mpirun -np 16 pw.x < WFNq.in > WFNq.out
2- mpirun -np 16 pw.x < WFN_co.in > WFN_co.out
2- mpirun -np 16 epsilon.cplx.x < epsilon.inp > epsilon.out
2- mpirun -np 16 sigma.cplx.x < sigma.inp > sigma.out

 

All steps went correctly except the last two, which gave the Assertion failure mentioned above after a few iterations. I have attached the epsilon.out file. 

Note:

  • The epsilon and sigma output files contain the following error:

WARNING: The number of cpus does not divide evenly in the optimal number of pools.

1cpus are doing no work

  • The epsilon and sigma calculations went as expected when I used one CPU:
    mpirun -np 1 epsilon.cplx.x < epsilon.inp > epsilon.out
0 Kudos
TobiasK
Moderator
748 Views

@Abdelghany


The calculation crashes after quite some time, please make sure that you are not just running out of memory. You can have a look at the kernel logs and check for OOM killer entries.

Please also reach out to the developers of the codes who may be helpful with triaging your issue. With the information you provided we cannot do much here.


Best

Tobias


0 Kudos
Reply