Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

Trouble with checkpointing with Intel MPI using blcr

hrscad_d_
Beginner
503 Views

Trying to run checkpointing with BLCR using the Intel MPI 4.1.3.049 library. Compiled the source MPI codes using the Intel mpicc compiler. 

While running, used mpiexec.hydra -ckpoint on -ckpointlib blcr and other options. The checkpoints do get written, but the application crashes with a segfault after the first checkpoint itself (after having written a multi gigabyte checkpoint context file to disk). The applications run perfectly to completion when I run them without the checkpoint options. Also, checkpointing runs without problem when run on single node with multiple MPI processes.

The commandline options I use to launch the jobs are:
mpiexec.hydra -genv I_MPI_FABRICS shm:ofa -machinefile ./nodes -n 24 -ckpoint on -ckpointlib blcr -ckpoint-interval 300 ./MPIJob

What might be going wrong here?

 

 

 

 

Detailed outputs are given below:

# mpiexec.hydra -genv I_MPI_FABRICS shm:ofa -machinefile ./nodes -n 24 -ckpoint on -ckpointlib blcr -ckpoint-interval 300 ./lmp_linux -var x 120 -var y 180 -var z 240 -in in.lj

Lattice spacing in x,y,z = 1.6796 1.6796 1.6796

Created orthogonal box = (0 0 0) to (201.552 302.327 403.103)

  2 by 3 by 4 processor grid
Created 20736000 atoms
Setting up run ...
Memory usage per processor = 242.749 Mbytes
Step Temp E_pair E_mol TotEng Press 
       0         1.44   -6.7733681            0   -4.6133682   -5.0196693 
[proxy:0:0@node002] requesting checkpoint
[proxy:0:0@node002] checkpoint initiated...
[proxy:0:1@node003] requesting checkpoint
[proxy:0:1@node003] checkpoint initiated...
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

real    5m49.577s
user    60m8.869s
sys    0m7.911s

-------------------------------------------------------------------------------------------------------------------------------

Running the same job without checkpointing results in the job running to completion: 

# mpiexec.hydra -genv I_MPI_FABRICS shm:ofa -machinefile ./nodes -n 24 ./lmp_linux -var x 120 -var y 180 -var z 240 -in in.lj
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (201.552 302.327 403.103)
  2 by 3 by 4 processor grid
Created 20736000 atoms
Setting up run ...
Memory usage per processor = 242.749 Mbytes
Step Temp E_pair E_mol TotEng Press 
       0         1.44   -6.7733681            0   -4.6133682   -5.0196693 
    1000   0.70393606   -5.6763694            0   -4.6204654   0.70387303 
Loop time of 1182.03 on 24 procs for 1000 steps with 20736000 atoms

Pair  time (%) = 947.357 (80.1466)
Neigh time (%) = 99.5461 (8.42162)
Comm  time (%) = 54.1397 (4.58023)
Outpt time (%) = 0.0121222 (0.00102554)
Other time (%) = 80.9755 (6.85055)

Nlocal:    864000 ave 864517 max 863234 min
Histogram: 1 1 0 2 3 5 3 4 2 3
Nghost:    152942 ave 153201 max 152593 min
Histogram: 1 2 0 2 4 3 3 4 3 2
Neighs:    3.23875e+07 ave 3.24491e+07 max 3.23351e+07 min
Histogram: 2 2 5 1 4 3 2 3 0 2

Total # of neighbors = 777300486
Ave neighs/atom = 37.4856
Neighbor list builds = 50
Dangerous builds = 0

real    19m46.883s
user    236m56.573s
sys    0m12.398s

0 Kudos
0 Replies
Reply