Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Novice
113 Views

mpiexec.hydra 2019u4 crashes on AMD Zen2

Hello,

mpexec.hydra binary from Inltel 2019U4 crashes on Zen2 and Zen1 platforms.

 

 

user@Zen1[pts/0]stream $ mpirun -np 2   /vend/intel/parallel_studio_xe_2019_update4/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/IMB-MPI1

/vend/intel/parallel_studio_xe_2019_update4/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpirun: line 103:  7399 Floating point exception(core dumped) mpiexec.hydra "$@" 0<&0

 

user@Zen2[pts/1]demo $ mpirun -np 2   /vend/intel/parallel_studio_xe_2019_update4/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/IMB-MPI1

/vend/intel/parallel_studio_xe_2019_update4/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpirun: line 103: 121108 Floating point exception(core dumped) mpiexec.hydra "$@" 0<&0

A strace reveals that mpiexec.hydra crashes trying to parse to processor configuration, I believe binary cpuininfo suffers from the same symptoms.

...

openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, [{d_ino=37, d_off=1, d_reclen=24, d_name=".", d_type=DT_DIR}, {d_ino=9, d_off=2690600, d_reclen=24, d_name="..", d_type=DT_DIR}, {d_ino=170171, d_off=25909499, d_reclen=24, d_name="smt", d_type=DT_DIR}, {d_ino=90582, d_off=25909675, d_reclen=24, d_name="cpu0", d_type=DT_DIR}, {d_ino=90600, d_off=25909851, d_reclen=24, d_name="cpu1", d_type=DT_DIR}, {d_ino=90619, d_off=25910027, d_reclen=24, d_name="cpu2", d_type=DT_DIR}, {d_ino=90638, d_off=25910203, d_reclen=24, d_name="cpu3", d_type=DT_DIR}, {d_ino=90657, d_off=25910379, d_reclen=24, d_name="cpu4", d_type=DT_DIR}, {d_ino=90676, d_off=25910555, d_reclen=24, d_name="cpu5", d_type=DT_DIR}, {d_ino=90695, d_off=25910731, d_reclen=24, d_name="cpu6", d_type=DT_DIR}, {d_ino=90714, d_off=25910907, d_reclen=24, d_name="cpu7", d_type=DT_DIR}, {d_ino=90733, d_off=25911083, d_reclen=24, d_name="cpu8", d_type=DT_DIR}, {d_ino=90752, d_off=141151836, d_reclen=24, d_name="cpu9", d_type=DT_DIR}, {d_ino=222492, d_off=141566558, d_reclen=32, d_name="cpufreq", d_type=DT_DIR}, {d_ino=82070, d_off=285014906, d_reclen=32, d_name="cpuidle", d_type=DT_DIR}, {d_ino=90771, d_off=285015082, d_reclen=32, d_name="cpu10", d_type=DT_DIR}, {d_ino=90790, d_off=285015258, d_reclen=32, d_name="cpu11", d_type=DT_DIR}, {d_ino=90809, d_off=285015434, d_reclen=32, d_name="cpu12", d_type=DT_DIR}, {d_ino=90828, d_off=285015610, d_reclen=32, d_name="cpu13", d_type=DT_DIR}, {d_ino=90847, d_off=285015786, d_reclen=32, d_name="cpu14", d_type=DT_DIR}, {d_ino=90866, d_off=285015962, d_reclen=32, d_name="cpu15", d_type=DT_DIR}, {d_ino=90885, d_off=285016138, d_reclen=32, d_name="cpu16", d_type=DT_DIR}, {d_ino=90904, d_off=285016314, d_reclen=32, d_name="cpu17", d_type=DT_DIR}, {d_ino=90923, d_off=285016490, d_reclen=32, d_name="cpu18", d_type=DT_DIR}, {d_ino=90942, d_off=285016842, d_reclen=32, d_name="cpu19", d_type=DT_DIR}, {d_ino=90961, d_off=285017018, d_reclen=32, d_name="cpu20", d_type=DT_DIR}, {d_ino=90980, d_off=285017194, d_reclen=32, d_name="cpu21", d_type=DT_DIR}, {d_ino=90999, d_off=285017370, d_reclen=32, d_name="cpu22", d_type=DT_DIR}, {d_ino=91018, d_off=285017546, d_reclen=32, d_name="cpu23", d_type=DT_DIR}, {d_ino=91037, d_off=285017722, d_reclen=32, d_name="cpu24", d_type=DT_DIR}, {d_ino=91056, d_off=285017898, d_reclen=32, d_name="cpu25", d_type=DT_DIR}, {d_ino=91075, d_off=285018074, d_reclen=32, d_name="cpu26", d_type=DT_DIR}, {d_ino=91094, d_off=285018250, d_reclen=32, d_name="cpu27", d_type=DT_DIR}, {d_ino=91113, d_off=285018426, d_reclen=32, d_name="cpu28", d_type=DT_DIR}, {d_ino=91132, d_off=285018778, d_reclen=32, d_name="cpu29", d_type=DT_DIR}, {d_ino=91151, d_off=285018954, d_reclen=32, d_name="cpu30", d_type=DT_DIR}, {d_ino=91170, d_off=285019130, d_reclen=32, d_name="cpu31", d_type=DT_DIR}, {d_ino=91189, d_off=285019306, d_reclen=32, d_name="cpu32", d_type=DT_DIR}, {d_ino=91208, d_off=285019482, d_reclen=32, d_name="cpu33", d_type=DT_DIR}, {d_ino=91227, d_off=285019658, d_reclen=32, d_name="cpu34", d_type=DT_DIR}, {d_ino=91246, d_off=285019834, d_reclen=32, d_name="cpu35", d_type=DT_DIR}, {d_ino=91265, d_off=285020010, d_reclen=32, d_name="cpu36", d_type=DT_DIR}, {d_ino=91284, d_off=285020186, d_reclen=32, d_name="cpu37", d_type=DT_DIR}, {d_ino=91303, d_off=285020362, d_reclen=32, d_name="cpu38", d_type=DT_DIR}, {d_ino=91322, d_off=285020714, d_reclen=32, d_name="cpu39", d_type=DT_DIR}, {d_ino=91341, d_off=285020890, d_reclen=32, d_name="cpu40", d_type=DT_DIR}, {d_ino=91360, d_off=285021066, d_reclen=32, d_name="cpu41", d_type=DT_DIR}, {d_ino=91379, d_off=285021242, d_reclen=32, d_name="cpu42", d_type=DT_DIR}, {d_ino=91398, d_off=285021418, d_reclen=32, d_name="cpu43", d_type=DT_DIR}, {d_ino=91417, d_off=285021594, d_reclen=32, d_name="cpu44", d_type=DT_DIR}, {d_ino=91436, d_off=285021770, d_reclen=32, d_name="cpu45", d_type=DT_DIR}, {d_ino=91455, d_off=285021946, d_reclen=32, d_name="cpu46", d_type=DT_DIR}, {d_ino=91474, d_off=285022122, d_reclen=32, d_name="cpu47", d_type=DT_DIR}, {d_ino=91493, d_off=285022298, d_reclen=32, d_name="cpu48", d_type=DT_DIR}, {d_ino=91512, d_off=285022650, d_reclen=32, d_name="cpu49", d_type=DT_DIR}, {d_ino=91531, d_off=285022826, d_reclen=32, d_name="cpu50", d_type=DT_DIR}, {d_ino=91550, d_off=285023002, d_reclen=32, d_name="cpu51", d_type=DT_DIR}, {d_ino=91569, d_off=285023178, d_reclen=32, d_name="cpu52", d_type=DT_DIR}, {d_ino=91588, d_off=285023354, d_reclen=32, d_name="cpu53", d_type=DT_DIR}, {d_ino=91607, d_off=285023530, d_reclen=32, d_name="cpu54", d_type=DT_DIR}, {d_ino=91626, d_off=285023706, d_reclen=32, d_name="cpu55", d_type=DT_DIR}, {d_ino=91645, d_off=285023882, d_reclen=32, d_name="cpu56", d_type=DT_DIR}, {d_ino=91664, d_off=285024058, d_reclen=32, d_name="cpu57", d_type=DT_DIR}, {d_ino=91683, d_off=285024234, d_reclen=32, d_name="cpu58", d_type=DT_DIR}, {d_ino=91702, d_off=285024586, d_reclen=32, d_name="cpu59", d_type=DT_DIR}, {d_ino=91721, d_off=285024762, d_reclen=32, d_name="cpu60", d_type=DT_DIR}, {d_ino=91740, d_off=285024938, d_reclen=32, d_name="cpu61", d_type=DT_DIR}, {d_ino=91759, d_off=285025114, d_reclen=32, d_name="cpu62", d_type=DT_DIR}, {d_ino=91778, d_off=318580955, d_reclen=32, d_name="cpu63", d_type=DT_DIR}, {d_ino=47, d_off=385790491, d_reclen=32, d_name="power", d_type=DT_DIR}, {d_ino=57, d_off=661204875, d_reclen=40, d_name="vulnerabilities", d_type=DT_DIR}, {d_ino=46, d_off=718872595, d_reclen=32, d_name="modalias", d_type=DT_REG}, {d_ino=42, d_off=900028725, d_reclen=32, d_name="kernel_max", d_type=DT_REG}, {d_ino=40, d_off=1321717208, d_reclen=32, d_name="possible", d_type=DT_REG}, {d_ino=39, d_off=1412398250, d_reclen=32, d_name="online", d_type=DT_REG}, {d_ino=43, d_off=1431608070, d_reclen=32, d_name="offline", d_type=DT_REG}, {d_ino=44, d_off=1472641949, d_reclen=32, d_name="isolated", d_type=DT_REG}, {d_ino=38, d_off=1826905203, d_reclen=32, d_name="uevent", d_type=DT_REG}, {d_ino=45, d_off=1905639739, d_reclen=32, d_name="nohz_full", d_type=DT_REG}, {d_ino=197551, d_off=2084586514, d_reclen=32, d_name="microcode", d_type=DT_DIR}, {d_ino=41, d_off=2147483647, d_reclen=32, d_name="present", d_type=DT_REG}], 32768) = 2496
getdents(3, [], 32768)                  = 0
close(3)                                = 0
uname({sysname="Linux", nodename="SERVER", release="3.10.0-1062.1.2.el7.x86_64", version="#1 SMP Mon Sep 30 14:19:46 UTC 2019", machine="x86_64", domainname="houston"}) = 0
sched_getaffinity(0, 128, [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]) = 128
--- SIGFPE {si_signo=SIGFPE, si_code=FPE_INTDIV, si_addr=0x44d325} ---
+++ killed by SIGFPE (core dumped) +++
Floating point exception (core dumped)

 

 

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    1
Core(s) per socket:    32
Socket(s):             2
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7502 32-Core Processor
Stepping:              0
CPU MHz:               1500.000
CPU max MHz:           2500.0000
CPU min MHz:           1500.0000
BogoMIPS:              5000.07
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31
NUMA node4 CPU(s):     32-39
NUMA node5 CPU(s):     40-47
NUMA node6 CPU(s):     48-55
NUMA node7 CPU(s):     56-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca
 

 

 

0 Kudos
7 Replies
Highlighted
113 Views

Hello,

 

Please check Intel MPI 2019 Update 5 or Update 6. This issue should be fixed there.

--

Best regards, Anatoliy

0 Kudos
Highlighted
Employee
113 Views

Hello Mike, We hope you have tried with the updated version of Intel MPI. Please let us know if this solves your issue. Regards, Neeraj
0 Kudos
Highlighted
Novice
113 Views

Thank you fro the response!

We are installing Intel2019u5. I will let you know if the issues have been addressed.

 

regards

Michael 

 

0 Kudos
Highlighted
Novice
113 Views

We tried Intel MPI 2019U5 and mpiexec.hydra does not crash upon start. Update 6 is not out yet. Any ETA?

 

Thank you for the suggestions!

Michael

0 Kudos
Highlighted
Moderator
113 Views

Hi Michael,

IMPI 2019 U6 is already released and available for download. Please try it and let us know. 

Best regards,

Jyotsna

Jyotsna Khemka
0 Kudos
Highlighted
Novice
113 Views

Hi Jyotsna,

Yes 2019.06 does not crash. Unfortunately there are still issues with performance. The proper provider 'mlx' does not work at all and 'verbs' that works sustains performance 6X below the wire speed.

 

Thanks

Michael

0 Kudos
Highlighted
113 Views

Hi,

Do you have some error messages with mlx provider?

--

Best regards, Anatoliy

 

0 Kudos