Can't run job with impi

xiong__wang · ‎03-12-2020

Dear all:

I am currently using INTEL COMPILER and IMPI to run the CICE numerical model. But I failed every single time. According to the runlog, the model returns the error message "rank 0 in job 2 node01_44414 caused collective abort of all ranks exit status of rank 0: killed by signal 11" I treid to searched this error meassage but unfortunately, I do not know much about MPI and have no idea about how to debug the error. One thing I can make sure is that CICE is a widely used numerical model and I don't think there is any major bug in the code which brings this error.

So is there anyone may provide some insight about this error or tell me which information i should provide to locate this error??

THANKS

Hearns__John · ‎03-16-2020

Hello Wang. I think the problem is that the CICE code is not available on your compute nodes.

Please do the following:

Log into the cluster login node or head node

Run the command which CICE then ldd `which CICE`

the ldd command will list the libraries which an eecutable code needs. if any ibraries are unavilable we need to investigate

Now log into node01 or any compute node. Run which CICE and ldd `which CICE`

Do you have the code available on node01 and are all the libraries available?

Hearns__John · ‎03-16-2020

Also please tell us a little about the HPC cluster which you are using.

I think the answer is to install CICE on a shared sotrage are which you have access to.

Runnign codes from the /home directory is normally a bad idea on HPC clusters

(This depends of course - if the /home is on a fast parallel filesystem what I said does not apply)

xiong__wang · ‎03-16-2020

Hearns, John wrote:
Hello Wang. I think the problem is that the CICE code is not available on your compute nodes.
Please do the following:
Log into the cluster login node or head node
Run the command which CICE then ldd `which CICE`
the ldd command will list the libraries which an eecutable code needs. if any ibraries are unavilable we need to investigate
Now log into node01 or any compute node. Run which CICE and ldd `which CICE`
Do you have the code available on node01 and are all the libraries available?

Hello John. Thank you for your reply.

I logged into the node01 and changed to the directory which CICE exists. I typed into the which CICE and I got "no CICE in (...... a lot of path)". And I typed in "ldd 'which CICE'". I got "which : ldd: ./which: No such file or directory, CICE: ldd: ./CICE: No such file or directory" I tried to add the CICE path into the PATH by "export PATH=/home/wangxiong/CICE/mycase8:$PATH". (/home/wangxiong/CICE/mycase8 is where CICE exists) After doing that. I typed in "which cice" and I got "~/CICE/mycase8/cice".

I typed into "ldd 'which cice'", I got

"cice:
linux-vdso.so.1 => (0x00007ffc77fe8000)
libnetcdf.so.15 => /usr/local/netcdf-intel-mpi/netcdf/lib/libnetcdf.so.15 (0x00007f77a18a9000)
libnetcdff.so.7 => /usr/local/netcdf-intel-mpi/netcdf/lib/libnetcdff.so.7 (0x00007f77a1409000)
libpnetcdf.so.4 => /usr/local/netcdf-intel-mpi/pnetcdf/lib/libpnetcdf.so.4 (0x00007f77a0a3b000)
libmpifort.so.12 => /opt/impi/5.0.1.035/intel64/lib/libmpifort.so.12 (0x00007f77a07af000)
libmpi.so.12 => /opt/impi/5.0.1.035/intel64/lib/libmpi.so.12 (0x00007f779fde3000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000318ac00000)
librt.so.1 => /lib64/librt.so.1 (0x000000318b800000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000000318a800000)
libm.so.6 => /lib64/libm.so.6 (0x000000318b000000)
libc.so.6 => /lib64/libc.so.6 (0x000000318a400000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000318d800000)
libhdf5_hl.so.8 => /usr/local/netcdf-intel-mpi/hdf5/lib/libhdf5_hl.so.8 (0x00007f779fb71000)
libhdf5.so.8 => /usr/local/netcdf-intel-mpi/hdf5/lib/libhdf5.so.8 (0x00007f779f595000)
libsz.so.2 => /usr/local/netcdf-intel-mpi/szip/lib/libsz.so.2 (0x00007f779f376000)
libz.so.1 => /lib64/libz.so.1 (0x000000318b400000)
libifport.so.5 => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libifport.so.5 (0x00007f779f147000)
libifcore.so.5 => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libifcore.so.5 (0x00007f779ee10000)
libimf.so => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libimf.so (0x00007f779e954000)
libsvml.so => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libsvml.so (0x00007f779df8a000)
libirc.so => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libirc.so (0x00007f779dd3b000)
libirng.so => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libirng.so (0x00007f779db34000)
libintlc.so.5 => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libintlc.so.5 (0x00007f779d8e6000)
/lib64/ld-linux-x86-64.so.2 (0x0000003189c00000)
"

And after that, I recompiled and rerun the model. Again I got the error message "rank 3 in job 11 node01_44414 caused collective abort of all ranks exit status of rank 3: killed by signal 11" The error message looks like poping up at where the calculations should begin.

FYI: I don't konw if this information would be helpful but I still want to mention it. About two to three years ago. This server has been used to run MITgcm(another numerical model) successfully(also use intel compiler and impi). And after that, this server hasn't been used to run model in parallel. I switched into the MITgcm directory and typed into which 'which mitgcmuv' . This time I got "./mitgcmuv" When typed into "ldd 'which mitgcmuv'", I also got some results like "

mitgcmuv:
linux-vdso.so.1 => (0x00007ffd1a797000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000318ac00000)
libmpi_dbg.so.4 => /opt/impi/5.0.1.035/intel64/lib/libmpi_dbg.so.4 (0x00007fef40ee5000)
libmpigf.so.4 => /opt/impi/5.0.1.035/intel64/lib/libmpigf.so.4 (0x00007fef40c59000)

libpthread.so.0 => /lib64/libpthread.so.0 (0x000000318a800000)
librt.so.1 => /lib64/librt.so.1 (0x000000318b800000)
libm.so.6 => /lib64/libm.so.6 (0x000000318b000000)
libc.so.6 => /lib64/libc.so.6 (0x000000318a400000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000318d800000)
/lib64/ld-linux-x86-64.so.2 (0x0000003189c00000)
"

I even tried to rerun the mitgcm case and it succeed. But I just can't make the CICE to run successfully which really makes me frustrating.

Hearns__John · ‎03-17-2020

I would try running the code with 2 processes but only on the login node - ie run it in an interactive session with np-2

Also set I_MPI_DEBUG=5 before you run the code

The next step if that work is to run between two compute nodes

PrasanthD_intel · ‎03-17-2020

Hi Wang,

Could you please provide the GitHub link of the CICE code you are working, so that we can reproduce from our side.

Also please provide the steps you are following and the environmental details (like OS, Compiler, Hardware) of your system.

Thanks

Prasanth

xiong__wang · ‎03-19-2020

Hearns, John wrote:
Also please tell us a little about the HPC cluster which you are using.
I think the answer is to install CICE on a shared sotrage are which you have access to.
Runnign codes from the /home directory is normally a bad idea on HPC clusters
(This depends of course - if the /home is on a fast parallel filesystem what I said does not apply)

Hey John.

What I am currently using is a small server. And here is what I got from 'lscpu'

"

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 112
On-line CPU(s) list: 0-111
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E7-4830 v4 @ 2.00GHz
Stepping: 1
CPU MHz: 2001.000
BogoMIPS: 3999.91
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13,56-69
NUMA node1 CPU(s): 14-27,70-83
NUMA node2 CPU(s): 28-41,84-97
NUMA node3 CPU(s): 42-55,98-111

"

Below is what I got from 'pbsnodes'

"

node01
state = free
np = 112
ntype = cluster
status = rectime=1584627510,varattr=,jobs=,state=free,netload=1049588601,gres=,loadave=62.00,ncpus=112,physmem=264605084kb,availmem=309521516kb,totmem=331724648kb,idletime=687823,nusers=4,nsessions=8,sessions=11365 11369 42732 42765 42843 59303 90588 101768,uname=Linux node01 2.6.32-696.el6.x86_64 #1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003

node02
state = free
np = 36
ntype = cluster
status = rectime=1584627470,varattr=,jobs=,state=free,netload=465920164,gres=,loadave=22.01,ncpus=36,physmem=132250420kb,availmem=192346396kb,totmem=199369952kb,idletime=1483530,nusers=3,nsessions=4,sessions=5401 5405 25630 30630,uname=Linux node02 2.6.32-573.el6.x86_64 #1 SMP Thu Jul 23 15:44:03 UTC 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003

"

xiong__wang · ‎03-19-2020

Hearns, John wrote:
I would try running the code with 2 processes but only on the login node - ie run it in an interactive session with np-2
Also set I_MPI_DEBUG=5 before you run the code

The next step if that work is to run between two compute nodes

Hello John.

I followed your advice and tried to run the job with an interactive session and 2 processors.

First I used the setting "I_MPI_DEBUG=5" . But I almost got the same result as the previous run (which is wired I think). "Finished writing ./history/iceh_ic.1998-01-01-00000.nc APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)".

After finish wirting the initial condition file, the job terminated and computation wan't start. And no more error message was given.

So I even increased the debug parameter to 20. But apart from some MPI related sentence I am not very familiar about. I still havn't discover some problem which may cause this run-time error.

I also uploaded these two logs in case you want to see them. The one named "run-output-dubug5" means it is log from the run with "I_MPI_DEBUG=5"

And "run-output-dubug20" manes it is the log from the run with "I_MPI_DEBUG=20".

Thanks again.

Have a good day!

xiong__wang · ‎03-19-2020

Dwadasi, Prasanth (Intel) wrote:
Hi Wang,
Could you please provide the GitHub link of the CICE code you are working, so that we can reproduce from our side.
Also please provide the steps you are following and the environmental details (like OS, Compiler, Hardware) of your system.

Thanks
Prasanth

Hello Prasanth. Thank you for your reply.

1. This is the main page of CICE project on GitHub.

https://github.com/CICE-Consortium/CICE

2. This is the CICE version index page. What I am currently trying to run is CICE Version 6.1.0

https://github.com/CICE-Consortium/CICE/wiki/CICE-Version-Index

3. Icepack version index page. Icepack is one part of CICE project. One may need to put the Icepack code in the CICE code directory before compile and run. CICE Version 6.1.0 corresponds to Icepack Version 1.2.0

https://github.com/CICE-Consortium/Icepack/wiki/Icepack-Version-Index

4. We also need to download some forcing data in order to do some test run.

https://github.com/CICE-Consortium/CICE/wiki/CICE-Input-Data

https://github.com/CICE-Consortium/Icepack/wiki/Icepack-Input-Data

5. CICE documentation for CICE Version 6.1.0

https://cice-consortium-cice.readthedocs.io/en/cice6.1.0/

Here are the steps for CICE compile and run.

1) Once downloaded the CICE and Icepack code and forcing data. Uncompress the code and forcing data.

Copy the Icepack code into the icepack directory under the CICE directory.

2) Porting (tells the cice model about the compiler information on this server)

this includes:

cd to configuration/scripts/machines/

Copy an existing env and a Macros file to new names for your new machine

Edit your env and Macros files

cd .. to configuration/scripts/

Edit the cice.batch.csh script to add a section for your machine with batch settings

Edit the cice.batch.csh script to add a section for your machine with job launch settings

Change the variable ICE_MACHINE_INPUTDATA in the env file accordint to where you put the forcing data.

3) setup the CICE run directory

cd to the CICE main directory

./cice.setup -c ~/mycase1 -g gx3 -m testmachine (gx3 means you choose the gx3 grid rather than gx1 grid , "testmachine" may change to the name you set for your Macros file in step 2 )

4) Compile the code

cd mycase1 ("mycase1" is specified by -c option in step 3)

./cice.build

5) Run the model

./cice.submit

The server is installed with CentOS release 6.9 (Final) and kernel version 2.6.32-696.el6.x86_64.

The IMPI version is : Intel(R) MPI Library, Version 5.0 Update 1 Build 20140709

Intel Compiler Version: Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.1.163 Build 20130313

Below are the hardware and nodes information:

"

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 112
On-line CPU(s) list: 0-111
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E7-4830 v4 @ 2.00GHz
Stepping: 1
CPU MHz: 2001.000
BogoMIPS: 3999.91
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13,56-69
NUMA node1 CPU(s): 14-27,70-83
NUMA node2 CPU(s): 28-41,84-97
NUMA node3 CPU(s): 42-55,98-111

"

——————————————————————————————————————————————————

"

node01
state = free
np = 112
ntype = cluster
status = rectime=1584627510,varattr=,jobs=,state=free,netload=1049588601,gres=,loadave=62.00,ncpus=112,physmem=264605084kb,availmem=309521516kb,totmem=331724648kb,idletime=687823,nusers=4,nsessions=8,sessions=11365 11369 42732 42765 42843 59303 90588 101768,uname=Linux node01 2.6.32-696.el6.x86_64 #1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003

node02
state = free
np = 36
ntype = cluster
status = rectime=1584627470,varattr=,jobs=,state=free,netload=465920164,gres=,loadave=22.01,ncpus=36,physmem=132250420kb,availmem=192346396kb,totmem=199369952kb,idletime=1483530,nusers=3,nsessions=4,sessions=5401 5405 25630 30630,uname=Linux node02 2.6.32-573.el6.x86_64 #1 SMP Thu Jul 23 15:44:03 UTC 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003

"

Many thanks.

Have a nice day.

Hearns__John · ‎03-19-2020

Here is another suggestion.

Can you create a simple MPI 'Hello World' program? Just Google for this.

compiel and line the program then submit it as a batch job. Set I_MPI_DEBUG=5 in the script

If a 'hello world' runs fine we can see that the cluster is working OK

PrasanthD_intel · ‎03-26-2020

Hi Wang,

As John suggested please try to run a sample Hello World program.

You can compile using

mpiicc <Foo.c >

and

run using

mpirun -np <number of processes> ./<a.out>

tell us whether you can run it successfully.

Thanks

Prasanth

PrasanthD_intel · ‎03-31-2020

Hi Wang,

The IMPI version you are using 5.0 which is an outdated version and currently not supported.

Can you update to the latest IMPI version and check if the error persists.

Also if possible could you check with other MPI vendors and see whether you are getting any errors.

Thanks

Prasanth

PrasanthD_intel · ‎04-27-2020

Hi Wang,

We are closing this thread considering your issue got resolved.

Please raise a new thread for any further questions.

Regards

Prasanth