Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

NEMO MPI crash when attempting to access one of the elements of a variable

Simon_Senneville
Beginner
1,109 Views

Hi,

First time here, not sure of the details needed in this message. Do not hesitate to make a suggestion.

I am compiling NEMO (ocean model) in a singularity container with MPI.

As far as I know, declarations and allocation are properly done. Compilation goes well. The same code compiles and run as intended on a different computer with an older IFORT compiler (not in a container).

In the singularity container, when I run the model on one or more CPU, I get an error when I try to access a specific index of a particular variable (one dimension array).

For example, if I write to screen the variable ( WRITE(numout, * ) (gdept_1d) ), I see the values but if I try to access at any particular indices of this array ( WRITE(numout, * ) (gdept_1d(1)) ), the model crash.

 

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image              PC                Routine            Line        Source

nemo.exe           000000000145D99A  Unknown               Unknown  Unknown

libpthread-2.27.s  00007F8ED11C0980  Unknown               Unknown  Unknown

nemo.exe           00000000008A0C5B  iom_mp_iom_init_          190  iom.f90

nemo.exe           000000000045E4EE  step_mp_stp_              148  step.f90

nemo.exe           000000000041BC9C  nemogcm_mp_nemo_g         145  nemogcm.f90

nemo.exe           000000000041BBE7  MAIN__                     18  nemo.f90

nemo.exe           000000000041BB82  Unknown               Unknown  Unknown

libc-2.27.so       00007F8ED0629BF7  __libc_start_main     Unknown  Unknown

nemo.exe           000000000041BA6A  Unknown               Unknown  Unknown

 

Thank you if you have any suggestions to solve this kind of problem.

 

Simon

Labels (1)
0 Kudos
6 Replies
VarshaS_Intel
Moderator
1,071 Views

Hi,


Thanks for posting in Intel Communities.


Could you please let us know the OS details, Intel MPI Library & Intel Fortran Compiler version you are using?


Could you please provide us with the complete steps(the Github link of NEMO/steps you have followed to build the NEMO application) and also the steps to reproduce your issue.


>>Compilation goes well. The same code compiles and run as intended on a different computer with an older IFORT compiler

Could you please let us know the older IFORT and MPI versions you are using where you are able to run successfully?


Thanks & Regards,

Varsha


0 Kudos
Simon_Senneville
Beginner
1,046 Views

Hello Varsha,

Here are the informations to reproduce the problem. I hope I did not forget anything. I think the simplest way is to give you access to the singularity .sif file. Here is the link to download it with a wget.

https://srwebpolr01.uqar.ca/polr/nemo_forum.sif

To get the directory:
singularity build --sandbox nemo_forum nemo_forum.sif

Use the directory:
singularity shell --writable nemo_forum

In singularity:
Singularity> cd /NEMO/NEMOGCM/CONFIG/MY_GYRE_new/EXP00/

There you have the result of running ./opa in this diretory. The problem come from the acces
to a specific element of the variable "gdept_1d". The line where it crash is 606 in file
"/NEMO/NEMOGCM/CONFIG/MY_GYRE_new/MY_SRC/istate.F90". If you erase the files produce by the executable you will
be able to run opa to get the same error.

Singularity> rm ocean.output nemo_status output.namelist.dyn mesh_mask.nc layout.dat
Singularity> ./opa

You can recompile the code by going in

Singularity> cd /NEMO/NEMOGCM/CONFIG/
Singularity> ./makenemo -n MY_GYRE_new -m mpiifort_linux


Do not hesitate to tell me if you need any other information, I really appreciate your help.


For the old system where it work:

[old system ~]$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.164 Build 20150121
Copyright (C) 1985-2015 Intel Corporation. All rights reserved.here is our compilation options

fcm arch file:

%NCDF_INC -I/share/apps/netcdf/ifort/include
%NCDF_LIB -L/share/apps/netcdf/ifort/lib -lnetcdff -lnetcdf
%XIOS_HOME /share/apps/xios/1.0
%XIOS_INC -I%XIOS_HOME/inc
%XIOS_LIB -L%XIOS_HOME/lib -lxios
%FC ifort
%FCFLAGS -r8 -O3 -traceback -openmp %NCDF_INC
%FFLAGS -r8 -O3 -traceback -openmp %NCDF_INC
%LD ifort
%CICE_FPP ${CICECMC_FPP}
%FPPFLAGS -P -C -traditional %CICE_FPP
%LDFLAGS -L/share/apps/intel/impi/5.0.3.048/intel64/lib %XIOS_LIB %NCDF_INC %NCDF_LIB -lstdc++ -openmp -L/usr -L/usr/lib64
%AR ar
%ARFLAGS -r
%MK gmake
%USER_INC %XIOS_INC %NCDF_INC
%USER_LIB %XIOS_LIB %NCDF_LIB
%CPP cpp[old system : EXP00]# ldd opa

ldd

linux-vdso.so.1 => (0x00007fffb35ef000)
libnetcdff.so.6 => not found
libnetcdf.so.7 => not found
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003348000000)
libmpi.so.12 => /share/apps/intel/impi/5.0.3.048/intel64/lib/release_mt/libmpi.so.12 (0x00002ba44812a000)
libmpifort.so.12 => /share/apps/intel/impi/5.0.3.048/intel64/lib/libmpifort.so.12 (0x00002ba4488b6000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003346c00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003347400000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003347000000)
libm.so.6 => /lib64/libm.so.6 (0x0000003346800000)
libiomp5.so => not found
libc.so.6 => /lib64/libc.so.6 (0x0000003346400000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003347c00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003346000000)

readelf -h

ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x413a00
Start of program headers: 64 (bytes into file)
Start of section headers: 36329544 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 8
Size of section headers: 64 (bytes)
Number of section headers: 33
Section header string table index: 30

 

0 Kudos
VarshaS_Intel
Moderator
1,009 Views

Hi,


Thanks for providing the details.


Could you please let us know the OS Details, and Intel MPI version in which your application got crashed? 


And also, could you please confirm whether you are able to get the expected results without using the Singularity Container with Intel MPI?


Could you please provide as us with the Singularity Container file you are using to run the NEMO application to investigate more on your issue?


Thanks & Regards,

Varsha



0 Kudos
VarshaS_Intel
Moderator
988 Views

Hi,


We have not heard back from you. Could you please provide us with the details mentioned in the previous reply to investigate more on your issue?


Thanks & Regards,

Varsha


0 Kudos
VarshaS_Intel
Moderator
984 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need additional information, please post a new question.


Thanks & Regards,

Varsha


0 Kudos
Simon_Senneville
Beginner
965 Views

Sorry about the delay, we were trying to answer your last questions.

 

Since we were able to run NEMO on a virtual machine, we tried to build a new singularity container instead of using your oneapi-hpckit container.

 

NEMO is compiling and running in this new “home-made” container.

 

We do not know for the moment exactly what is not working when we use your container.

 

Here is the recipe to build our container.

 

docker run -it rockylinux

 

yum update

yum -y install cmake pkgconfig

yum -y groupinstall "Development Tools"

which cmake pkg-config make gcc g++

 

tee > /tmp/oneAPI.repo << EOF

[oneAPI]

name=Intel® oneAPI repository

baseurl=https://yum.repos.intel.com/oneapi

enabled=1

gpgcheck=1

repo_gpgcheck=1

gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB

EOF

 

 

cat /tmp/oneAPI.repo

mv /tmp/oneAPI.repo /etc/yum.repos.d

yum -y install intel-hpckit

 

. /opt/intel/oneapi/setvars.sh

wget -y https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.2/src/hdf5-1.12.2.tar.gz

./configure --enable-hl --enable-parallel FC=mpiifort CXX=mpiicpc CC=mpiicc

make

make install

 

vim /etc/ld.so.conf.d/uqar.conf

ajouter /usr/local/lib

ldconfig

 

cp hdf5-1.12/hdf/include/* /usr/local/include 

 yum install  libxml2-devel

 

wget https://github.com/Unidata/netcdf-c/archive/refs/tags/v4.9.0.tar.gz

./configure FC=mpiifort CXX=mpiicpc CC=mpiicc 

 

wget https://github.com/Unidata/netcdf-fortran/archive/refs/tags/v4.5.4.tar.gz

./configure FC=mpiifort CXX=mpiicpc CC=mpiicc

 

yum install perl-URI

yum install  perl-Text-Balanced

yum install libcurl-devel

 

Thanks for your time, hoping it could help someone else.

 

Simon

0 Kudos
Reply