- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I am compiling different codes (details at the end) using the Intel Cluster Studio 2013 for Linux (C and Fortran compilers, MKL BLACS and MKL FFT3W) + Intel MPI 4.0.3.008. The programs run without problems when using one computing node, but they crash when I try to use more than one computing node.
I have gathered all the possible information from the execution and MPI calls with these options of mpirun: -v -check_mpi -genv I_MPI_DEBUG 5. The resulting information is in the attached files.
The interesting information is at the end of the files, where you can find:
from vasp.log:
[23] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[23] ERROR: Fatal signal 11 (SIGSEGV) raised.
[23] ERROR: Signal was encountered at:
[23] ERROR: hamil_mp_hamiltmu_ (/home/ivasan/programas/VASP/vasp.5.3_test/vasp)
[23] ERROR: After leaving:
[23] ERROR: mpi_allreduce_(*sendbuf=0x7fff5d1ce340, *recvbuf=0x18e19c0, count=1, datatype=MPI_DOUBLE_PRECISION, op=MPI_SUM, comm=0xffffffffc4060000 CART_SUB CART_CREATE CART_SUB CART_CREATE COMM_WORLD [18:23], *ierr=0x7fff5d1ce2ac->MPI_SUCCESS)
from abinit.log:
[23] ERROR: LOCAL:MPI:CALL_FAILED: error
[23] ERROR: Null communicator.
[23] ERROR: Error occurred at:
[23] ERROR: mpi_comm_rank_(comm=MPI_COMM_NULL, *rank=0x29319b8, *ierr=0x7fff83fabb74)
[23] ERROR: initmpi_grid_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/51_manage_mpi/initmpi_grid.F90:178)
[23] ERROR: invars1_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/57_iovars/invars1.F90:1015)
[23] ERROR: invars1m_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/57_iovars/invars1m.F90:186)
[23] ERROR: m_ab6_invars_mp_ab6_invars_load_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/57_iovars/m_ab6_invars_f90.F90:548)
[23] ERROR: MAIN__ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/98_main/abinit.F90:260)
[23] ERROR: main (/home/ivasan/programas/abinit/abinit-6.12.3b/bin/abinit)
[23] ERROR: (/lib64/libc-2.5.so)
[23] ERROR: (/home/ivasan/programas/abinit/abinit-6.12.3b/bin/abinit)
So in both cases the problems seem to be related to MPI.
What can I do to solve these errors?
Thanks in advance for your help.
Iván
CODES:
- VASP V5.3.2 (http://www.vasp.at/). I posted this issue at the support forum: http://cms.mpi.univie.ac.at/vasp-forum/forum_viewtopic.php?3.12037
- Abinit V6.12.3 (http://www.abinit.org/). I posted this issue at the support forum: http://forum.abinit.org/viewtopic.php?f=3&t=1851
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hiii Ivan,
I am having the same problem as u during the running of UM model. I compiled this model using Intel® MPI .but during running I am getting following error
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
libc.so.6 0000003488A30215 Unknown Unknown Unknown
libc.so.6 0000003488A31CC0 Unknown Unknown Unknown
N216L85.exe 000000000040A416 Unknown Unknown Unknown
N216L85.exe 000000000044D172 Unknown Unknown Unknown
libmpi.so.4 00002B8DB34627FA Unknown Unknown Unknown
libmpi.so.4 00002B8DB3386661 Unknown Unknown Unknown
libmpigf.so.4 00002B8DB393D279 Unknown Unknown Unknown
N216L85.exe 000000000175EADD Unknown Unknown Unknown
N216L85.exe 0000000001058602 Unknown Unknown Unknown
N216L85.exe 000000000106DFEA Unknown Unknown Unknown
N216L85.exe 0000000000B91CFD Unknown Unknown Unknown
N216L85.exe 000000000089B3FA Unknown Unknown Unknown
N216L85.exe 00000000004E1F04 Unknown Unknown Unknown
N216L85.exe 000000000048548F Unknown Unknown Unknown
N216L85.exe 000000000040D76B Unknown Unknown Unknown
N216L85.exe 0000000000404C7C Unknown Unknown Unknown
N216L85.exe 0000000000404BAC Unknown Unknown Unknown
libc.so.6 0000003488A1D974 Unknown Unknown Unknown
N216L85.exe 0000000000404AB9 Unknown Unknown Unknown
send desc error
send desc error
[11] Abort: Got completion with error 12, vendor code=81, dest rank=
at line 870 in file ../../ofa_poll.c
[9] Abort: Got completion with error 12, vendor code=81, dest rank=
at line 870 in file ../../ofa_poll.c
I need a solution of this error.suggest me some ways so that i can resove the error
Thanks in advance for your help.
Somanath Moharana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Somanath,
According to your attached file, it seems that your problem is related to the shared libraries, that are not found in the executing nodes:
pbs_demux: error while loading shared libraries: libtorque.so.2: cannot open shared object file: No such file or directory
Make sure that you have access to the libraries you need in all the nodes you use (check your LD_LIBRARY_PATH in your profile). In addition, it seems that you are using PBS/Torque. The problem could be due to a bad integration between Intel MPI and PBS/Torque.
Anyway, check this post where I explained in more detail what I did to solve my problem:
http://software.intel.com/en-us/forums/topic/370967
Regards,
Ivan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hii Ivan,
Thanx for your sugestion. I tried to run the UM model without PBS/Torque but the same problem I am facing as before
N216L85.exe 000000000044D172 Unknown Unknown Unknown
libmpi.so.4 00002B11F5A697FA Unknown Unknown Unknown
libmpi.so.4 00002B11F598D661 Unknown Unknown Unknown
libmpigf.so.4 00002B11F5F44279 Unknown Unknown Unknown
N216L85.exe 000000000175EADD Unknown Unknown Unknown
N216L85.exe 0000000001059A61 Unknown Unknown Unknown
N216L85.exe 000000000106DFEA Unknown Unknown Unknown
N216L85.exe 0000000000B91CFD Unknown Unknown Unknown
N216L85.exe 000000000089B3FA Unknown Unknown Unknown
N216L85.exe 00000000004E1F04 Unknown Unknown Unknown
N216L85.exe 000000000048548F Unknown Unknown Unknown
N216L85.exe 000000000040D76B Unknown Unknown Unknown
N216L85.exe 0000000000404C7C Unknown Unknown Unknown
N216L85.exe 0000000000404BAC Unknown Unknown Unknown
libc.so.6 000000354F81D974 Unknown Unknown Unknown
N216L85.exe 0000000000404AB9 Unknown Unknown Unknown
[7:compute-0-11.local] unexpected disconnect completion event from [2:compute-0-5.local]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 7
[6:compute-0-11.local] unexpected disconnect completion event from [2:compute-0-5.local]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 6
[11:compute-0-17.local] unexpected disconnect completion event from [2:compute-0-5.local]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 11
I think it is the same memory issue as you said before but I dont find a solution for that error ......
Kind Regards,
Somanath Moharana

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page