Community
cancel
Showing results for 
Search instead for 
Did you mean: 
hrscad_d_
Beginner
34 Views

Trouble with checkpointing with Intel MPI using blcr

Trying to run checkpointing with BLCR using the Intel MPI 4.1.3.049 library. Compiled the source MPI codes using the Intel mpicc compiler. 

While running, used mpiexec.hydra -ckpoint on -ckpointlib blcr and other options. The checkpoints do get written, but the application crashes with a segfault after the first checkpoint itself (after having written a multi gigabyte checkpoint context file to disk). The applications run perfectly to completion when I run them without the checkpoint options. Also, checkpointing runs without problem when run on single node with multiple MPI processes.

The commandline options I use to launch the jobs are:
mpiexec.hydra -genv I_MPI_FABRICS shm:ofa -machinefile ./nodes -n 24 -ckpoint on -ckpointlib blcr -ckpoint-interval 300 ./MPIJob

What might be going wrong here?

 

 

 

 

Detailed outputs are given below:

# mpiexec.hydra -genv I_MPI_FABRICS shm:ofa -machinefile ./nodes -n 24 -ckpoint on -ckpointlib blcr -ckpoint-interval 300 ./lmp_linux -var x 120 -var y 180 -var z 240 -in in.lj

Lattice spacing in x,y,z = 1.6796 1.6796 1.6796

Created orthogonal box = (0 0 0) to (201.552 302.327 403.103)

  2 by 3 by 4 processor grid
Created 20736000 atoms
Setting up run ...
Memory usage per processor = 242.749 Mbytes
Step Temp E_pair E_mol TotEng Press 
       0         1.44   -6.7733681            0   -4.6133682   -5.0196693 
[proxy:0:0@node002] requesting checkpoint
[proxy:0:0@node002] checkpoint initiated...
[proxy:0:1@node003] requesting checkpoint
[proxy:0:1@node003] checkpoint initiated...
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

real    5m49.577s
user    60m8.869s
sys    0m7.911s

-------------------------------------------------------------------------------------------------------------------------------

Running the same job without checkpointing results in the job running to completion: 

# mpiexec.hydra -genv I_MPI_FABRICS shm:ofa -machinefile ./nodes -n 24 ./lmp_linux -var x 120 -var y 180 -var z 240 -in in.lj
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (201.552 302.327 403.103)
  2 by 3 by 4 processor grid
Created 20736000 atoms
Setting up run ...
Memory usage per processor = 242.749 Mbytes
Step Temp E_pair E_mol TotEng Press 
       0         1.44   -6.7733681            0   -4.6133682   -5.0196693 
    1000   0.70393606   -5.6763694            0   -4.6204654   0.70387303 
Loop time of 1182.03 on 24 procs for 1000 steps with 20736000 atoms

Pair  time (%) = 947.357 (80.1466)
Neigh time (%) = 99.5461 (8.42162)
Comm  time (%) = 54.1397 (4.58023)
Outpt time (%) = 0.0121222 (0.00102554)
Other time (%) = 80.9755 (6.85055)

Nlocal:    864000 ave 864517 max 863234 min
Histogram: 1 1 0 2 3 5 3 4 2 3
Nghost:    152942 ave 153201 max 152593 min
Histogram: 1 2 0 2 4 3 3 4 3 2
Neighs:    3.23875e+07 ave 3.24491e+07 max 3.23351e+07 min
Histogram: 2 2 5 1 4 3 2 3 0 2

Total # of neighbors = 777300486
Ave neighs/atom = 37.4856
Neighbor list builds = 50
Dangerous builds = 0

real    19m46.883s
user    236m56.573s
sys    0m12.398s

0 Kudos
0 Replies
Reply