Community
cancel
Showing results for 
Search instead for 
Did you mean: 
xiong__wang
Beginner
213 Views

Can't run job with impi

Dear all:

I am currently using INTEL COMPILER and IMPI to run the CICE numerical model. But I failed every single time. According to the runlog, the model returns the error message "rank 0 in job 2  node01_44414   caused collective abort of all ranks exit status of rank 0: killed by signal 11" I treid to searched this error meassage but unfortunately, I do not know much about MPI and have no idea about how to debug the error. One thing I can make sure is that CICE is a widely used numerical model and I don't think there is any major bug in the code which brings this error. 

So is there anyone may provide some insight about this error or tell me which information i should provide  to locate this error??

THANKS 

0 Kudos
12 Replies
Hearns__John
Beginner
213 Views

Hello Wang.  I think the problem is that the CICE code is not available on your compute nodes.

Please do the following:

Log into the cluster login node or head node

Run the command  which CICE    then  ldd `which CICE`

the ldd command will list the libraries which an eecutable code needs. if any ibraries are unavilable we need to investigate

Now log into node01  or any compute node. Run which CICE and   ldd `which CICE`

Do you have the code available on node01 and are all the libraries available?

Hearns__John
Beginner
213 Views

Also please tell us a little about the HPC cluster which you are using.

I think the answer is to install CICE on a shared sotrage are which you have access to.

Runnign codes from the /home directory is normally a bad idea on HPC clusters

(This depends of course - if the /home is on a fast parallel filesystem what I said does not apply)

xiong__wang
Beginner
213 Views

Hearns, John wrote:

Hello Wang.  I think the problem is that the CICE code is not available on your compute nodes.

Please do the following:

Log into the cluster login node or head node

Run the command  which CICE    then  ldd `which CICE`

the ldd command will list the libraries which an eecutable code needs. if any ibraries are unavilable we need to investigate

Now log into node01  or any compute node. Run which CICE and   ldd `which CICE`

Do you have the code available on node01 and are all the libraries available?

Hello John. Thank you for your reply. 

I logged into the node01 and changed to the directory which CICE exists. I typed into the which CICE and I got "no CICE in (...... a lot of path)". And I typed in "ldd 'which CICE'". I got "which : ldd: ./which: No such file or directory, CICE: ldd: ./CICE: No such file or directory" I tried to add the CICE path into the PATH by "export PATH=/home/wangxiong/CICE/mycase8:$PATH". (/home/wangxiong/CICE/mycase8 is where CICE exists) After doing that. I typed in "which cice" and I got "~/CICE/mycase8/cice".

I typed into "ldd 'which cice'", I got

"cice:
        linux-vdso.so.1 =>  (0x00007ffc77fe8000)
        libnetcdf.so.15 => /usr/local/netcdf-intel-mpi/netcdf/lib/libnetcdf.so.15 (0x00007f77a18a9000)
        libnetcdff.so.7 => /usr/local/netcdf-intel-mpi/netcdf/lib/libnetcdff.so.7 (0x00007f77a1409000)
        libpnetcdf.so.4 => /usr/local/netcdf-intel-mpi/pnetcdf/lib/libpnetcdf.so.4 (0x00007f77a0a3b000)
        libmpifort.so.12 => /opt/impi/5.0.1.035/intel64/lib/libmpifort.so.12 (0x00007f77a07af000)
        libmpi.so.12 => /opt/impi/5.0.1.035/intel64/lib/libmpi.so.12 (0x00007f779fde3000)
        libdl.so.2 => /lib64/libdl.so.2 (0x000000318ac00000)
        librt.so.1 => /lib64/librt.so.1 (0x000000318b800000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x000000318a800000)
        libm.so.6 => /lib64/libm.so.6 (0x000000318b000000)
        libc.so.6 => /lib64/libc.so.6 (0x000000318a400000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000318d800000)
        libhdf5_hl.so.8 => /usr/local/netcdf-intel-mpi/hdf5/lib/libhdf5_hl.so.8 (0x00007f779fb71000)
        libhdf5.so.8 => /usr/local/netcdf-intel-mpi/hdf5/lib/libhdf5.so.8 (0x00007f779f595000)
        libsz.so.2 => /usr/local/netcdf-intel-mpi/szip/lib/libsz.so.2 (0x00007f779f376000)
        libz.so.1 => /lib64/libz.so.1 (0x000000318b400000)
        libifport.so.5 => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libifport.so.5 (0x00007f779f147000)
        libifcore.so.5 => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libifcore.so.5 (0x00007f779ee10000)
        libimf.so => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libimf.so (0x00007f779e954000)
        libsvml.so => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libsvml.so (0x00007f779df8a000)
        libirc.so => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libirc.so (0x00007f779dd3b000)
        libirng.so => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libirng.so (0x00007f779db34000)
        libintlc.so.5 => /opt/intel/icc/composer_xe_2013.3.163/compiler/lib/intel64/libintlc.so.5 (0x00007f779d8e6000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003189c00000)
"

And after that, I recompiled and rerun the model. Again I got the error message "rank 3 in job 11  node01_44414   caused collective abort of all ranks  exit status of rank 3: killed by signal 11" The error message looks like poping up at where the calculations should begin. 

FYI: I don't konw if this information would be helpful but I still want to mention it. About two to three years ago. This server has been used to run MITgcm(another numerical model) successfully(also use intel compiler and impi). And after that, this server hasn't been used to run model in parallel. I switched into the MITgcm directory and typed into which 'which mitgcmuv' . This time I got "./mitgcmuv" When typed into "ldd 'which mitgcmuv'", I also got some results like "

mitgcmuv:
        linux-vdso.so.1 =>  (0x00007ffd1a797000)
        libdl.so.2 => /lib64/libdl.so.2 (0x000000318ac00000)
        libmpi_dbg.so.4 => /opt/impi/5.0.1.035/intel64/lib/libmpi_dbg.so.4 (0x00007fef40ee5000)
        libmpigf.so.4 => /opt/impi/5.0.1.035/intel64/lib/libmpigf.so.4 (0x00007fef40c59000)       

        libpthread.so.0 => /lib64/libpthread.so.0 (0x000000318a800000)
        librt.so.1 => /lib64/librt.so.1 (0x000000318b800000)
        libm.so.6 => /lib64/libm.so.6 (0x000000318b000000)
        libc.so.6 => /lib64/libc.so.6 (0x000000318a400000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000318d800000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003189c00000)
"

I even tried to rerun the mitgcm case and it succeed. But I just can't make the CICE to run successfully which really makes me frustrating. 

 

Hearns__John
Beginner
213 Views

I would try running the code with 2 processes but only on the login node - ie run it in an interactive session with np-2

Also set I_MPI_DEBUG=5  before you run the code

 

The next step if that work is to run between two compute nodes

PrasanthD_intel
Moderator
213 Views

Hi Wang,

Could you please provide the GitHub link of the CICE code you are working, so that we can reproduce from our side.

Also please provide the steps you are following and the environmental details (like OS, Compiler, Hardware) of your system.

 

Thanks

Prasanth

xiong__wang
Beginner
213 Views

Hearns, John wrote:

Also please tell us a little about the HPC cluster which you are using.

I think the answer is to install CICE on a shared sotrage are which you have access to.

Runnign codes from the /home directory is normally a bad idea on HPC clusters

(This depends of course - if the /home is on a fast parallel filesystem what I said does not apply)

 

Hey John. 

What I am currently using is a small server. And here is what I got from 'lscpu'

"

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                112
On-line CPU(s) list:   0-111
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E7-4830 v4 @ 2.00GHz
Stepping:              1
CPU MHz:               2001.000
BogoMIPS:              3999.91
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0-13,56-69
NUMA node1 CPU(s):     14-27,70-83
NUMA node2 CPU(s):     28-41,84-97
NUMA node3 CPU(s):     42-55,98-111

"

Below is what I got from 'pbsnodes'

"

node01
     state = free
     np = 112
     ntype = cluster
     status = rectime=1584627510,varattr=,jobs=,state=free,netload=1049588601,gres=,loadave=62.00,ncpus=112,physmem=264605084kb,availmem=309521516kb,totmem=331724648kb,idletime=687823,nusers=4,nsessions=8,sessions=11365 11369 42732 42765 42843 59303 90588 101768,uname=Linux node01 2.6.32-696.el6.x86_64 #1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003

node02
     state = free
     np = 36
     ntype = cluster
     status = rectime=1584627470,varattr=,jobs=,state=free,netload=465920164,gres=,loadave=22.01,ncpus=36,physmem=132250420kb,availmem=192346396kb,totmem=199369952kb,idletime=1483530,nusers=3,nsessions=4,sessions=5401 5405 25630 30630,uname=Linux node02 2.6.32-573.el6.x86_64 #1 SMP Thu Jul 23 15:44:03 UTC 2015 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003

"

xiong__wang
Beginner
213 Views

Hearns, John wrote:

I would try running the code with 2 processes but only on the login node - ie run it in an interactive session with np-2

Also set I_MPI_DEBUG=5  before you run the code

 

The next step if that work is to run between two compute nodes

 

Hello John.

I followed your advice and tried to run the job with an interactive session and 2 processors.

First I used the setting "I_MPI_DEBUG=5" . But I almost got the same result as the previous run (which is wired I think). "Finished writing ./history/iceh_ic.1998-01-01-00000.nc APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)".

After finish wirting the initial condition file, the job terminated and computation wan't start. And no more error message was given.

So I even increased the debug parameter to 20. But apart from some MPI related sentence I am not very familiar about. I still havn't discover some problem which may cause this run-time error. 

I also uploaded these two logs in case you want to see them. The one named "run-output-dubug5" means it is log from the run  with "I_MPI_DEBUG=5"

And "run-output-dubug20" manes it is the log from the run with "I_MPI_DEBUG=20".

Thanks again.

Have a good day!

xiong__wang
Beginner
213 Views

Dwadasi, Prasanth (Intel) wrote:

Hi Wang,

Could you please provide the GitHub link of the CICE code you are working, so that we can reproduce from our side.

Also please provide the steps you are following and the environmental details (like OS, Compiler, Hardware) of your system.

 

Thanks

Prasanth

 

Hello Prasanth. Thank you for your reply.

1. This is the main page of CICE project on GitHub.

https://github.com/CICE-Consortium/CICE

2. This is the CICE version index page. What I am currently trying to run is CICE Version 6.1.0

https://github.com/CICE-Consortium/CICE/wiki/CICE-Version-Index

3. Icepack version index page. Icepack is one part of CICE project. One may need to put the Icepack code in the CICE code directory before compile and run.  CICE Version 6.1.0 corresponds to Icepack Version 1.2.0

https://github.com/CICE-Consortium/Icepack/wiki/Icepack-Version-Index

4. We also need to download some forcing data in order to do some test run.

https://github.com/CICE-Consortium/CICE/wiki/CICE-Input-Data

https://github.com/CICE-Consortium/Icepack/wiki/Icepack-Input-Data

5. CICE documentation for CICE Version 6.1.0

https://cice-consortium-cice.readthedocs.io/en/cice6.1.0/

 

Here are the steps for CICE compile and run.

1) Once downloaded the CICE and Icepack code and forcing data. Uncompress the code and forcing data.

Copy the Icepack code into the icepack directory under the CICE directory.

2) Porting (tells the cice model about the compiler information on this server)

this includes:

cd to configuration/scripts/machines/

Copy an existing env and a Macros file to new names for your new machine

Edit your env and Macros files

cd .. to configuration/scripts/

Edit the cice.batch.csh script to add a section for your machine with batch settings

Edit the cice.batch.csh script to add a section for your machine with job launch settings

Change the variable ICE_MACHINE_INPUTDATA in the env file accordint to where you put the forcing data.

3) setup the CICE run directory

cd to the CICE main directory

./cice.setup -c ~/mycase1 -g gx3 -m testmachine  (gx3 means you choose the gx3 grid rather than gx1 grid , "testmachine" may change to the name you set for your Macros file in step 2 )

4) Compile the code

cd mycase1 ("mycase1" is specified by -c option in step 3)

./cice.build

5) Run the model

./cice.submit

 

The server is installed with CentOS release 6.9 (Final) and kernel version 2.6.32-696.el6.x86_64.

The IMPI version is : Intel(R) MPI Library, Version 5.0 Update 1  Build 20140709

Intel Compiler Version:  Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.1.163 Build 20130313

Below are the hardware  and nodes information:

"

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                112
On-line CPU(s) list:   0-111
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E7-4830 v4 @ 2.00GHz
Stepping:              1
CPU MHz:               2001.000
BogoMIPS:              3999.91
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0-13,56-69
NUMA node1 CPU(s):     14-27,70-83
NUMA node2 CPU(s):     28-41,84-97
NUMA node3 CPU(s):     42-55,98-111

"

——————————————————————————————————————————————————

"

node01
     state = free
     np = 112
     ntype = cluster
     status = rectime=1584627510,varattr=,jobs=,state=free,netload=1049588601,gres=,loadave=62.00,ncpus=112,physmem=264605084kb,availmem=309521516kb,totmem=331724648kb,idletime=687823,nusers=4,nsessions=8,sessions=11365 11369 42732 42765 42843 59303 90588 101768,uname=Linux node01 2.6.32-696.el6.x86_64 #1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003

node02
     state = free
     np = 36
     ntype = cluster
     status = rectime=1584627470,varattr=,jobs=,state=free,netload=465920164,gres=,loadave=22.01,ncpus=36,physmem=132250420kb,availmem=192346396kb,totmem=199369952kb,idletime=1483530,nusers=3,nsessions=4,sessions=5401 5405 25630 30630,uname=Linux node02 2.6.32-573.el6.x86_64 #1 SMP Thu Jul 23 15:44:03 UTC 2015 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003

"
 

Many thanks.

Have a nice day.

 

Hearns__John
Beginner
213 Views

Here is another suggestion.

Can you create a simple MPI 'Hello World' program?  Just Google for this.

compiel and line the program then submit it as a batch job. Set I_MPI_DEBUG=5 in the script

If a 'hello world' runs fine we can see that the cluster is working OK

PrasanthD_intel
Moderator
213 Views

Hi Wang,

As John suggested please try to run a sample Hello World program.

You can compile using 

mpiicc <Foo.c >

and

run using

mpirun -np <number of processes> ./<a.out>

tell us whether you can run it successfully.

 

Thanks

Prasanth

PrasanthD_intel
Moderator
213 Views

Hi Wang,

The IMPI version you are using 5.0 which is an outdated version and currently not supported.

Can you update to the latest IMPI  version and check if the error persists.

Also if possible could you check with other MPI vendors and see whether you are getting any errors.

 

Thanks

Prasanth

 

PrasanthD_intel
Moderator
213 Views

Hi Wang,

We are closing this thread considering your issue got resolved.

Please raise a new thread for any further questions.

 

Regards

Prasanth

Reply