- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm developing a MPI application, which relies heavily on the MPI shared memory. Recently, I keep hitting the following error messages:
srun: error: compute-42-013: task 32: Bus error
srun: Terminating job step 324080.0
slurmstepd: error: *** STEP 324080.0 ON compute-42-012 CANCELLED AT 2020-06-14T04:17:51 ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
pVelodyne_intel_4 000000000C8E308E Unknown Unknown Unknown
libpthread-2.17.s 00002B370FBBA5D0 Unknown Unknown Unknown
pVelodyne_intel_4 000000000397721B PMPIDI_CH3I_Progr 1040 ch3_progress.c
pVelodyne_intel_4 00000000039FC370 MPIC_Wait 269 helper_fns.c
pVelodyne_intel_4 00000000039FD83A MPIC_Sendrecv 580 helper_fns.c
pVelodyne_intel_4 000000000392F61B MPIR_Allgather_in 257 allgather.c
pVelodyne_intel_4 0000000003931752 MPIR_Allgather 858 allgather.c
pVelodyne_intel_4 0000000003931A77 MPIR_Allgather_im 905 allgather.c
pVelodyne_intel_4 0000000003933226 PMPI_Allgather 1068 allgather.c
pVelodyne_intel_4 000000000392CECE Unknown Unknown Unknown
srun: error: compute-41-006: task 16: Bus error
srun: Terminating job step 324024.0
slurmstepd: error: *** STEP 324024.0 ON compute-41-006 CANCELLED AT 2020-06-13T16:54:13 ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
pVelodyne_intel_4 000000000C85058E Unknown Unknown Unknown
libpthread-2.17.s 00002AEBC007A5D0 Unknown Unknown Unknown
pVelodyne_intel_4 00000000038E46DB PMPIDI_CH3I_Progr 1040 ch3_progress.c
pVelodyne_intel_4 0000000003969830 MPIC_Wait 269 helper_fns.c
pVelodyne_intel_4 000000000396ACFA MPIC_Sendrecv 580 helper_fns.c
pVelodyne_intel_4 00000000038BA379 MPIR_Alltoall_int 438 alltoall.c
pVelodyne_intel_4 00000000038BBE3D MPIR_Alltoall 734 alltoall.c
pVelodyne_intel_4 00000000038BC162 MPIR_Alltoall_imp 775 alltoall.c
pVelodyne_intel_4 00000000038BD875 PMPI_Alltoall 958 alltoall.c
It seems the bus error is inside the MPI subroutine. Since I do not have the source code of intel MPI, I have no idea what went wrong.
The intel mpi version I'm using is intel_parallel_studio/2018u4/compilers_and_libraries_2018.5.274.
Any idea how to fix it?
Thanks.
- Tags:
- Cluster Computing
- General Support
- Intel® Cluster Ready
- Message Passing Interface (MPI)
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Since you got a bus error can you check the memory allocations of your program i.e whether the memory allocated is more than the system's memory?
Also, there is a limit for the number of communicators in MPI which was around 32000. This means the maximum number of windows you can create is 32000.
Looks like the program is terminated by the srun but from the given trace we are not sure why.
Can you provide a reproducer code so that we can debug from our side?
Can your check your code using ITAC[For analyzing] and Intel Inspector[For memory errors].
For more info on using these tools refer to this query https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/623887
Also, we suggest you upgrade to the latest version 2019u7 and check if the error persists.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my application, I allocated several(<10) huge MPI shared memories to hold the datasets. It is possible that the memory run out since I do not see this kind of bus error when the datasets were smaller.
Here is my next question, why the code did not quit during the memory allocation, such as oom-kill? In the code, I actually check every memory allocation to make sure they were successfully allocated and initialize them to zero. If memory was out, I assume the Linux system would kill the process, right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please share the command line you are using? We would like to know whether you are using MPIRun or srun or mpiexec to launch.
Please share the details of the interconnect fabric you were using and the size of the MPI window in your program.
Also, we recommend you to upgrade to the latest version of Intel MPI, as the RMA window allocations have been optimized in the latest versions
Could you please also share the details of the application that you are using?
Also, we would like to know at what point you are getting this error, is it immediately after launching the program or is it after a delay of say 1 hour or so.
Also, we request you to please provide details of the NIC card that you are using
These details would be of help in debugging the issues that you are facing.
Thanks
Goutham
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please let us know if your issue is resolved.
If not do let us know. So that we will be able to help you regarding the same.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are assuming this issue has been resolved and will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only
Regards
Prasanth

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page