Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2161 Discussions

parallel_studio_xe_2019_update3_cluster_edition.tgz (mpiexec.hydra - floating point exception)

Matteo_Guglielmi
Beginner
1,516 Views

After installing:

parallel_studio_xe_2019_update3_cluster_edition.tgz

on any of the following OSes:

CentOS 7.6 / RHEL 7.6 / RHEL 8.0

and sourcing the corresponding env file:

source /opt/intel/bin/compilervars.sh -arch intel64 -platform linux

the following simple mpirun command:

mpirun -ppn 1 -n 1 -hosts localhost hostname

fails with the following error:

/opt/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin/mpirun: line 103: 63486 Floating point exceptionmpiexec.hydra "$@" 0<&0

 

Is anybody else experiencing the same issue?

 

Thank you.

0 Kudos
3 Replies
Maksim_B_Intel
Employee
1,516 Views

Your system may be crashing hwloc library. Please, try with I_MPI_HYDRA_TOPOLIB=ipl .

0 Kudos
Matteo_Guglielmi
Beginner
1,516 Views

setting I_MPI_HYDRA_TOPOLIB to ipl:

 

export I_MPI_HYDRA_TOPOLIB=ipl

 

does not change anything in terms of strace:

strace mpiexec.hydra -ppn 1 -n 1 -hosts localhost hostname

...

open("/home/dalco/.mpiexec.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/home/dalco/mpiexec.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, /* 145 entries */, 32768)   = 4544
getdents(3, /* 0 entries */, 32768)     = 0
close(3)                                = 0
uname({sysname="Linux", nodename="dalcosrv", ...}) = 0
sched_getaffinity(0, 128, [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127]) = 128
--- SIGFPE {si_signo=SIGFPE, si_code=FPE_INTDIV, si_addr=0x4429b6} ---
+++ killed by SIGFPE +++
Floating point exception
 

On a lab machine provided by intel for testing:

cat /etc/os-release 
NAME="Clear Linux OS"
VERSION=1
ID=clear-linux-os
ID_LIKE=clear-linux-os
VERSION_ID=29400
PRETTY_NAME="Clear Linux OS"
ANSI_COLOR="1;35"
HOME_URL="https://clearlinux.org"
SUPPORT_URL="https://clearlinux.org"
BUG_REPORT_URL="mailto:dev@lists.clearlinux.org"
PRIVACY_POLICY_URL="http://www.intel.com/privacy"
 

the same parallel studio installation runs smoothly:

mpiexec.hydra -ppn 1 -n 1 -hosts localhost hostname
clxap1.lab.internal
 

### here is a successful strace command on the lab machine ###

 

getdents64(3</sys/devices/system/cpu>, /* 211 entries */, 32768) = 6656
getdents64(3</sys/devices/system/cpu>, /* 0 entries */, 32768) = 0
close(3</sys/devices/system/cpu>)       = 0
uname({sysname="Linux", nodename="clxap1.lab.internal", ...}) = 0
sched_getaffinity(0, 128, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]) = 40
openat(AT_FDCWD, "/", O_RDONLY|O_DIRECTORY) = 3</>
fcntl(3</>, F_GETFD)                    = 0
fcntl(3</>, F_SETFD, FD_CLOEXEC)        = 0
faccessat(3</>, "sys/bus/cpu/devices/cpu0/topology/thread_siblings", R_OK) = 0
faccessat(3</>, "sys/bus/node/devices/node0/cpumap", R_OK) = 0
uname({sysname="Linux", nodename="clxap1.lab.internal", ...}) = 0
openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 4</sys/devices/system/cpu/online>
...

0 Kudos
Maksim_B_Intel
Employee
1,516 Views

Ok, start

gdb mpiexec.hydra -ppn 1 -n 1 -hosts localhost hostname

type in run to start the command, and when it displays message about getting floating-point exception, type bt.

What is the output?

Also, I didn't see anywhere what hardware it fails on.

0 Kudos
Reply