Re: Help troubleshooting MPI bad termination

fanselm · ‎08-30-2021

We have a user of our software that can't get our program to run with MPI on his cluster.

The program is a customized version of Python and we compile and link and ship it with Intel MPI 2018.1.163.

We got the user to run the most basic test, run through SLURM:

#!/bin/bash
#SBATCH -p RM
#SBATCH --nodes=1 # node count
#SBATCH --ntasks-per-node=2 # number of tasks per node
#SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --time=2:00
#SBATCH --job-name=chk

export I_MPI_DEBUG=5
/path/to/our_program/libexec/mpiexec.hydra -n 2 /path/to/our_program/bin/atkpython -c "print('Working!')"

which should just print out "Working!" twice, once for each process. This works fine on his laptop and on literally 100s if not 1000s of other computers and clusters. However, when he run it on his cluster he simply gets:

[0] MPI startup(): Multi-threaded optimized library

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 51582 RUNNING AT r153
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 51582 RUNNING AT r153
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================

As you can see, the error hints at a segfault. But I would be extremely surprised if there are any segfaults in this simple "Hello world" example, which has been proven to work on many other machines. I am more thinking there is some problem with the machine or its configuration. One thing that surprises me is that even with I_MPI_DEBUG=5 there is almost a complete lack of information at the MPI startup phase. Normally I would expect to see more output, which I indeed get on both my laptop and on other clusters, where it runs fine. Our program even normally prints some startup text which is also not visible, which makes me think that our program, atkpython, is not even being started - at least it doesn't get that far.

I got this information about his system:

 r149 ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
Stepping:            0
CPU MHz:             3219.040
BogoMIPS:            4491.91
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
 
Kernel info:

r149 ~]$ uname -srv
Linux 4.18.0-193.28.1.el8_2.x86_64 #1 SMP Thu Oct 22 00:20:22 UTC 2020

-Distribution: CentOS 8

-Network: 

r149 ~]$ lspci | grep -i network
24:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

Can someone help me troubleshoot this problem? I honestly don't even know where to start - and unfortunately I don't have access to the machine myself.

JyotsnaK_Intel · ‎09-02-2021

Hi ,

Thank you for your inquiry. We offer support for hardware platforms that the Intel® oneAPI product supports. These platforms include those that are part of the Intel® Core™ processor family or higher, the Intel® Xeon® processor family, the Intel® Xeon® Scalable processor family, and others which can be found here – Intel® oneAPI Base Toolkit System Requirements, Intel® oneAPI HPC Toolkit System Requirements, Intel® oneAPI IoT Toolkit System Requirements

If you wish to use oneAPI on hardware that is not listed at one of the sites above, we encourage you to visit and contribute to the open oneAPI specification - https://www.oneapi.io/spec/

JyotsnaK_Intel · ‎09-23-2021

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.