There is an issue we have been facing for the past few months.
We used to use a C code for our simulations. It used to run successfully on 108 nodes (each node has 16 processors), but we could not make the code run on more than 108 nodes.
Right now i am using a Fortran 90 code (it is just a fortran version of the above C code - both C and F90 codes have the same functionality) which runs successfully even on 256 nodes ie 4096 processors --- but the success is limited. When i try to write binary data from each individual processor, errors crop up after some number of processors write data. Data is written properly when i use only 108 nodes. If i do not write outputs, then the fortran code executes properly even on 256 nodes.
The errors start only when the data write process begins. The code then aborts after some data is written.
The machine has Dual Intel Xeon E5-2670 8 core processors at 2.6GhZ, Linux OS. The intel compiler version is intel-cluster-studio-2013. We use Intel mpi 4.1.0.024. The mpi fortran compiler is mpiifort. In our pbs script we use mpirun -np 4096 ./exename > outpur.txt
Some of the errors are listed below:
[272:cn0719.cmmacs.ernet.in] unexpected disconnect completion event from [400:cn0727.cmmacs.ernet.in]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
Application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
[proxy:0:0@cn0175] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:73): one of the processes terminated badly; aborting
[proxy:0:0@cn0175] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[proxy:0:0@cn0175] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:1476): bootstrap server returned error waiting for completion
[proxy:0:0@cn0175] main (./pm/pmiserv/pmip.c:392): error waiting for event children completion
[mpiexec@cn0175] control_cb (./pm/pmiserv/pmiserv_cb.c:674): assert (!closed) failed
[mpiexec@cn0175] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@cn0175] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event
[mpiexec@cn0175] main (./ui/mpich/mpiexec.c:718): process manager error waiting for completion
Can someone tell me the next steps i should follow?