- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Trying to run checkpointing with BLCR using the Intel MPI 4.1.3.049 library. Compiled the source MPI codes using the Intel mpicc compiler.
While running, used mpiexec.hydra -ckpoint on -ckpointlib blcr and other options. The checkpoints do get written, but the application crashes with a segfault after the first checkpoint itself (after having written a multi gigabyte checkpoint context file to disk) The applications run perfectly to completion when I run them without the checkpoint options.
The commandline options I use to launch the jobs are:
mpiexec.hydra -genv I_MPI_FABRICS shm:ofa -machinefile ./nodes -n 24 -ckpoint on -ckpointlib blcr -ckpoint-interval 300 ./MPIJob
Detailed outputs are given below:
mpiexec.hydra -genv I_MPI_FABRICS shm:ofa -machinefile ./nodes -n 24 -ckpoint on -ckpointlib blcr -ckpoint-interval 300 ./lmp_linux -var x 120 -var y 180 -var z 240 -in in.lj
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (201.552 302.327 403.103)
2 by 3 by 4 processor grid
Created 20736000 atoms
Setting up run ...
Memory usage per processor = 242.749 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133682 -5.0196693
[proxy:0:0@enode007] requesting checkpoint
[proxy:0:0@enode007] checkpoint initiated...
[proxy:0:1@enode008] requesting checkpoint
[proxy:0:1@enode008] checkpoint initiated...
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
real 5m49.577s
user 60m8.869s
sys 0m7.911s
-------------------------------------------------------------------------------------------------------------------------------
Running the same job without checkpointing results in the job running to completion:
mpiexec.hydra -genv I_MPI_FABRICS shm:ofa -machinefile ./nodes -n 24 ./lmp_linux -var x 120 -var y 180 -var z 240 -in in.lj
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (201.552 302.327 403.103)
2 by 3 by 4 processor grid
Created 20736000 atoms
Setting up run ...
Memory usage per processor = 242.749 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133682 -5.0196693
1000 0.70393606 -5.6763694 0 -4.6204654 0.70387303
Loop time of 1182.03 on 24 procs for 1000 steps with 20736000 atoms
Pair time (%) = 947.357 (80.1466)
Neigh time (%) = 99.5461 (8.42162)
Comm time (%) = 54.1397 (4.58023)
Outpt time (%) = 0.0121222 (0.00102554)
Other time (%) = 80.9755 (6.85055)
Nlocal: 864000 ave 864517 max 863234 min
Histogram: 1 1 0 2 3 5 3 4 2 3
Nghost: 152942 ave 153201 max 152593 min
Histogram: 1 2 0 2 4 3 3 4 3 2
Neighs: 3.23875e+07 ave 3.24491e+07 max 3.23351e+07 min
Histogram: 2 2 5 1 4 3 2 3 0 2
Total # of neighbors = 777300486
Ave neighs/atom = 37.4856
Neighbor list builds = 50
Dangerous builds = 0
real 19m46.883s
user 236m56.573s
sys 0m12.398s
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page