Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

MPI Bus Error

Pan__Hua
Beginner
3,039 Views

I'm developing a MPI application, which relies heavily on the MPI shared memory. Recently, I keep hitting the following error messages:

 

srun: error: compute-42-013: task 32: Bus error
srun: Terminating job step 324080.0
slurmstepd: error: *** STEP 324080.0 ON compute-42-012 CANCELLED AT 2020-06-14T04:17:51 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
pVelodyne_intel_4  000000000C8E308E  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B370FBBA5D0  Unknown               Unknown  Unknown
pVelodyne_intel_4  000000000397721B  PMPIDI_CH3I_Progr        1040  ch3_progress.c
pVelodyne_intel_4  00000000039FC370  MPIC_Wait                 269  helper_fns.c
pVelodyne_intel_4  00000000039FD83A  MPIC_Sendrecv             580  helper_fns.c
pVelodyne_intel_4  000000000392F61B  MPIR_Allgather_in         257  allgather.c
pVelodyne_intel_4  0000000003931752  MPIR_Allgather            858  allgather.c
pVelodyne_intel_4  0000000003931A77  MPIR_Allgather_im         905  allgather.c
pVelodyne_intel_4  0000000003933226  PMPI_Allgather           1068  allgather.c
pVelodyne_intel_4  000000000392CECE  Unknown               Unknown  Unknown


srun: error: compute-41-006: task 16: Bus error
srun: Terminating job step 324024.0
slurmstepd: error: *** STEP 324024.0 ON compute-41-006 CANCELLED AT 2020-06-13T16:54:13 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
pVelodyne_intel_4  000000000C85058E  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AEBC007A5D0  Unknown               Unknown  Unknown
pVelodyne_intel_4  00000000038E46DB  PMPIDI_CH3I_Progr        1040  ch3_progress.c
pVelodyne_intel_4  0000000003969830  MPIC_Wait                 269  helper_fns.c
pVelodyne_intel_4  000000000396ACFA  MPIC_Sendrecv             580  helper_fns.c
pVelodyne_intel_4  00000000038BA379  MPIR_Alltoall_int         438  alltoall.c
pVelodyne_intel_4  00000000038BBE3D  MPIR_Alltoall             734  alltoall.c
pVelodyne_intel_4  00000000038BC162  MPIR_Alltoall_imp         775  alltoall.c
pVelodyne_intel_4  00000000038BD875  PMPI_Alltoall             958  alltoall.c
 

It seems the bus error is inside the MPI subroutine. Since I do not have the source code  of intel MPI, I have no idea what went wrong. 

The intel mpi version I'm using is intel_parallel_studio/2018u4/compilers_and_libraries_2018.5.274. 

Any idea how to fix it? 

 

Thanks.

 

 

0 Kudos
5 Replies
PrasanthD_intel
Moderator
3,039 Views

Hi,

Since you got a bus error can you check the memory allocations of your program i.e whether the memory allocated is more than the system's memory?

Also, there is a limit for the number of communicators in MPI which was around 32000. This means the maximum number of windows you can create is 32000.

Looks like the program is terminated by the srun but from the given trace we are not sure why.

Can you provide a reproducer code so that we can debug from our side?

Can your check your code using ITAC[For analyzing] and Intel Inspector[For memory errors].

For more info on using these tools refer to this query https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/623887

Also, we suggest you upgrade to the latest version 2019u7 and check if the error persists.

 

Regards

Prasanth

0 Kudos
Pan__Hua
Beginner
3,039 Views

In my application, I allocated several(<10) huge MPI shared memories to hold the datasets. It is possible that the memory run out since I do not see this kind of bus error when the datasets were smaller.

Here is my next question, why the code did not quit during the memory allocation, such as oom-kill? In the code, I actually check every memory allocation to make sure they were successfully allocated and initialize them to zero. If memory was out, I assume the Linux system would kill the process, right?

 

  

0 Kudos
GouthamK_Intel
Moderator
3,013 Views

Hi,

Could you please share the command line you are using? We would like to know whether you are using MPIRun or srun or mpiexec to launch.

Please share the details of the interconnect fabric you were using and the size of the MPI window in your program.

Also, we recommend you to upgrade to the latest version of Intel MPI, as the RMA window allocations have been optimized in the latest versions

Could you please also share the details of the application that you are using?


Also, we would like to know at what point you are getting this error, is it immediately after launching the program or is it after a delay of say 1 hour or so.


Also, we request you to please provide details of the NIC card that you are using


These details would be of help in debugging the issues that you are facing.


Thanks

Goutham


0 Kudos
PrasanthD_intel
Moderator
2,954 Views

Hi,


Could you please let us know if your issue is resolved.

If not do let us know. So that we will be able to help you regarding the same.

 

Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
2,914 Views

Hi,


We are assuming this issue has been resolved and will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


Regards

Prasanth


0 Kudos
Reply