Re: Floating invalid suggests

Matt_Thompson · ‎05-20-2020

All,

Maybe this is just in my head, but did the Intel Fortran traceback output change recently? It seemed like we used to get tracebacks that looked like what is seen on this page:

https://software.intel.com/content/www/us/en/develop/documentation/fortran-compiler-developer-guide-and-reference/top/compiler-reference/error-handling/handling-run-time-errors/run-time-message-display-and-format.html

a la:

forrtl: error (72): floating overflow
Image        PC         Routine       Line      Source
ovf.exe      08049E4A   MAIN__            14    ovf.f90
ovf.exe      08049F08   Unknown       Unknown   Unknown
ovf.exe      400B3507   Unknown       Unknown   Unknown

but instead we are seeing a lot of:

[borgr168:11828:0:11828] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace ====
    0  /usr/lib64/libucs.so.0(+0x1935c) [0x2aab7485b35c]
    1  /usr/lib64/libucs.so.0(+0x19613) [0x2aab7485b613]
    2  /gpfsm/dswdev/bmauer/models/GEOSadas-5_12_4_p23_SLES12_M2-OPS/GEOSadas/Linux/bin/GEOSgcm.x() [0x430f16a]
    3  /gpfsm/dswdev/bmauer/models/GEOSadas-5_12_4_p23_SLES12_M2-OPS/GEOSadas/Linux/bin/GEOSgcm.x() [0x40f4d33]
    4  /gpfsm/dswdev/bmauer/models/GEOSadas-5_12_4_p23_SLES12_M2-OPS/GEOSadas/Linux/bin/GEOSgcm.x() [0x3fefafc]
    5  /gpfsm/dswdev/bmauer/models/GEOSadas-5_12_4_p23_SLES12_M2-OPS/GEOSadas/Linux/bin/GEOSgcm.x() [0xd724458]
    6  /gpfsm/dswdev/bmauer/models/GEOSadas-5_12_4_p23_SLES12_M2-OPS/GEOSadas/Linux/bin/GEOSgcm.x() [0xd726699]
    7  /gpfsm/dswdev/bmauer/models/GEOSadas-5_12_4_p23_SLES12_M2-OPS/GEOSadas/Linux/bin/GEOSgcm.x() [0xd954e7b]

Now, it is entirely possible this might be due to a system change (we recently upgraded from SLES 11 to SLES 12), so perhaps a system library is missing? I'm not sure though, as I can use the toy example on that webpage above and I can get the "desired" output on both OSs.

So perhaps something in the way we are using Intel Fortran/Intel MPI is causing this? We are generally running Intel 18.0.5 with Intel MPI 19.1.0 or even Intel 19.1.0 with Intel MPI 19.1.0. Or maybe a module we also have loaded (say, for gcc 6.5) might cause it?

Any ideas?

Steve_Lionel · ‎05-20-2020

That's a gcc stack trace, not ifort.

Matt_Thompson · ‎05-21-2020

Steve,

Huh. Do you have any idea why gcc would be highjacking the trace? I do have a gcc module loaded but only because (I think?) we need it for icpc or icc. But when we make, we are definitely Intel all the way:

-- The Fortran compiler identification is Intel 18.0.5.20180823
-- The CXX compiler identification is Intel 18.0.5.20180823
-- The C compiler identification is Intel 18.0.5.20180823
-- Check for working Fortran compiler: /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/bin/intel64/ifort
-- Check for working Fortran compiler: /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/bin/intel64/ifort - works
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Checking whether /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/bin/intel64/ifort supports Fortran 90
-- Checking whether /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/bin/intel64/ifort supports Fortran 90 - yes
-- Check for working CXX compiler: /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/bin/intel64/icpc
-- Check for working CXX compiler: /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/bin/intel64/icpc - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working C compiler: /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/bin/intel64/icc
-- Check for working C compiler: /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/bin/intel64/icc - works

Steve_Lionel · ‎05-21-2020

I am very weak when it comes to Linux, but I'm guessing that some library routine you called (what is libucs?) is where the error occurred and it had code to set up gcc's error handling.

Matt_Thompson · ‎05-21-2020

Looks UCX (http://www.openucx.org) provides it. Which, I guess means MPI had a bad time, but when an MPI application crashes, MPI usually fails with it...

We are using Intel MPI, so it must be finding this...but I'm also running on an Omnipath system where libfabric should not care about the Mellanox-oriented UCX... grah.

Steve_Lionel · ‎05-21-2020

Floating invalid suggests that some previous operation resulted in a NaN. Try compiling with -fpe0 and see if anything shakes out.

Matt_Thompson · ‎08-05-2020

Note. It looks like I was able to trace this to Intel MPI. I added:

use MPI

call MPI_Init
call MPI_Finalize

and if I use Intel MPI 19.0.5 or newer, the traceback moves from Intel format to GNU. I have opened a ticket with Support

Ron_Green · ‎08-05-2020

It is indeed in MPI somewhere.

but just to be sure, you are compiling AND LINKING with "-g -traceback" along with your other options, and using ifort as your linker and not 'ld' correct?

that said, with the crash inside libucs.so it's in MPI somewhere.

Ron

Matt_Thompson · ‎08-07-2020

@Ron_Green Indeed, I am compiling with: -g -traceback -fpe0 in all tests as:

mpiifort -g -traceback -fpe0 ovf_with_MPI.F90 -o ovf_with_MPI.x

I assume this would use ifort as linker?

My support ticket has a lot more info, but I can see things like at a certain point, our cluster's MPI installs moved from debug_mt to release_mt in library paths, etc., but all my efforts at using debug_mt with the newer MPI stacks and the test case seem to be unsuccessful (I_MPI_LIBRARY_KIND, -link_mpi) even hardcoding into bits a copy of mpiifort.

Matt_Thompson · ‎08-24-2020

More updates. Working with the Intel MPI team there is an unsatisfactory, but possible way forward. To with, the default FI_PROVIDER_PATH for the MPI stacks that are "bad" (GNU traceback) for me have:

-rwxr-xr-x 1 swmgr k3000 243900 Aug 6 2019 libmlx-fi.so
-rwxr-xr-x 1 swmgr k3000 542240 Aug 6 2019 libpsmx2-fi.so
-rwxr-xr-x 1 swmgr k3000 399284 Aug 6 2019 librxm-fi.so
-rwxr-xr-x 1 swmgr k3000 380926 Aug 6 2019 libsockets-fi.so
-rwxr-xr-x 1 swmgr k3000 255854 Aug 6 2019 libtcp-fi.so
-rwxr-xr-x 1 swmgr k3000 354867 Aug 6 2019 libverbs-fi.so

while the "good" ones have:

-rwxr-xr-x 1 swmgr k3000 542280 Apr 30 2019 libpsmx2-fi.so
-rwxr-xr-x 1 swmgr k3000 399324 Apr 30 2019 librxm-fi.so
-rwxr-xr-x 1 swmgr k3000 380966 Apr 30 2019 libsockets-fi.so
-rwxr-xr-x 1 swmgr k3000 255894 Apr 30 2019 libtcp-fi.so
-rwxr-xr-x 1 swmgr k3000 354907 Apr 30 2019 libverbs-fi.so

The difference is libmlx-fi. If I create a new directory and copy over all the .so save that for libmlx, set FI_PROVIDER_PATH to that new directory, I can then get the Intel traceback back.

At present, about the only thing I can think of doing is creating a new directory for every (>=19.0.5) version of Intel MPI installed on our system, copy over all the so but the libmlx so and create a new modulefile for a user to load that overrides FI_PROVIDER_PATH and then ask them to load that module if they have a crash and want the "good" traceback. Not perfect, but I guess it's something. (We have both Infiniband and Omnipath clusters on our system, so asking the admins to remove the MLX FI seems a bit harsh as, well, I'm guessing it's better performing than verbs?)

I'm hoping the Intel MPI folks might come up with something a bit more "elegant" but for now I wanted to update this.

Change in Traceback Output in Recent Intel Fortran?