Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Peil__Oleg
Beginner
481 Views

Intel Fortran 2019 + MPI cause an unexpected Segmentation Fault [Linux]

    Hello,

The following code example compiled with `mpiifort` produces a segfault error:
 

module test_intel_mpi_mod
   implicit none
   integer, parameter :: dp = kind(1.0d0)

   type :: Container
      complex(kind=dp), allocatable :: arr(:, :, :)
   end type

contains
   subroutine test_intel_mpi()
      use mpi_f08, only: &
         MPI_Init_thread, &
         MPI_THREAD_SINGLE, &
         MPI_Finalize, &
         MPI_Comm_rank, &
         MPI_COMM_WORLD, &
         MPI_COMPLEX16, &
         MPI_Bcast

      integer :: provided
      integer :: rank
      type(Container) :: cont

      call MPI_Init_thread(MPI_THREAD_SINGLE, provided)
      call MPI_Comm_rank(MPI_COMM_WORLD, rank)

      allocate(cont % arr(1, 1, 1))

      if (rank == 0) then
         cont % arr(1, 1, 1) = (1.0_dp, 2.0_dp)
      endif

! This works fine --->  call MPI_Bcast(cont % arr(1, 1, 1), 1, MPI_COMPLEX16, 0, MPI_COMM_WORLD)
      call MPI_Bcast(cont % arr(:, :, 1), 1, MPI_COMPLEX16, 0, MPI_COMM_WORLD)

      print *, rank, " after Bcast: ", cont % arr(1, 1, 1)
      call MPI_Finalize()
   end subroutine test_intel_mpi
end module test_intel_mpi_mod

program test_mpi
   use test_intel_mpi_mod

   call test_intel_mpi()
end program test_mpi

 

The code is compiled simply as follows: `mpiifort -o test_mpi test_mpi.f90`  and executed as `mpirun -np N ./test_mpi` (N = 1, 2, ...).

The output for N=2 is the following (also `-g -traceback` was added in this case):

           0  after Bcast:  (1.00000000000000,2.00000000000000)
           1  after Bcast:  (1.00000000000000,2.00000000000000)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
test_mpi           000000000041475A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AE5C8C0A5D0  Unknown               Unknown  Unknown
test_mpi           000000000040941D  Unknown               Unknown  Unknown
test_mpi           0000000000409D79  Unknown               Unknown  Unknown
test_mpi           00000000004044C0  test_intel_mpi_mo          44  test_mpi.f90
test_mpi           00000000004044E0  MAIN__                     50  test_mpi.f90
test_mpi           0000000000403BA2  Unknown               Unknown  Unknown
libc-2.17.so       00002AE5C913B3D5  __libc_start_main     Unknown  Unknown
test_mpi           0000000000403AA9  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
test_mpi           000000000041475A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AB3DEFB75D0  Unknown               Unknown  Unknown
test_mpi           000000000040941D  Unknown               Unknown  Unknown
test_mpi           0000000000409D79  Unknown               Unknown  Unknown
test_mpi           00000000004044C0  test_intel_mpi_mo          44  test_mpi.f90
test_mpi           00000000004044E0  MAIN__                     50  test_mpi.f90
test_mpi           0000000000403BA2  Unknown               Unknown  Unknown
libc-2.17.so       00002AB3DF4E83D5  __libc_start_main     Unknown  Unknown
test_mpi           0000000000403AA9  Unknown               Unknown  Unknown

 

The program crashes when it tries to exit the subroutine. The problem seems to be related to passing of the array section, `cont % arr(:, :, 1)`, to MPI_Bcast, as opposed to a reference to the first element, `cont % arr(1, 1, 1)` (this version of the call is left commented in the source code provided). At the same time, my understanding of the standard is that array sections, contiguous or not, are explicitly allowed in MPI 3.x (e.g., see https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node409.htm).

Details in the source are important to reproduce the segfault:

  • The crash happens only if MPI_Bcast is called -- commenting it out prevents the error
  • The subroutine must be in a module
  • The array must be at least 3-dimensional, allocatable, and be contained in a derived type object
  • Non-blocking MPI_Ibcast, as well as other collectives implying broadcast (e.g., Allreduce) give the same result

Compiler/library versions:

Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.1.217 Build 20200306

IntelMPI is from the same build: 2019.7.pre-intel-19.1.0.166-7

Output with I_DEBUG_MPI=6:

[0] MPI startup(): libfabric version: 1.9.0a1-impi
[0] MPI startup(): libfabric provider: psm2
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       278604   l49        {0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23}
[0] MPI startup(): 1       278605   l49        {8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31}
[0] MPI startup(): I_MPI_ROOT=....
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=6

OS: CentOS Linux release 7.6.1810 (Core)

Kernel: 3.10.0-957.10.1.el7.x86_64

0 Kudos
12 Replies
PrasanthD_intel
Moderator
481 Views

Hi Oleg,

We tried the code and reproduced the same at our end. 

Can you share a use case as to why you want to send the array to MPI_Bcast like this arr(:,:, 1)?

We will investigate this further at our end and get back to you.

 

Thanks

Prasanth

Peil__Oleg
Beginner
481 Views

A couple of points to clarify the example 1). The error first appeared in a large project with an internal library (set of modules) for supporting MPI communication of various data types. Most of the data types are derived types containing a 1D, 2D, or 3D allocatable array. In 1D cases, the lower bound of the arrays could be either 0 or 1. It was, thus, more natural to send a section `obj % arr(:)` with the number of elements set to `size(obj % arr, 1)` to avoid extra checks. The same style was employed for 2D array subsections [although the lower bounds for these arrays are always (1:, 1:)]. One should say that passing 1D sections to collectives has never caused any trouble. 2). In the end, the workaround in our case is rather simple (use `obj % arr(1, 1)` as a buffer reference in collective operations) but the error is rather surprising and it took some time to figure out that it was not due to a bug in the code itself. By the way, in the real project the error occurred after a call to `MPI_Iallreduce()`, and the first suspicion was that there was something wrong with asynchronous communications (side effects or spurious communication overlap). However, after some investigation the problem was traced back to the strange behavior of `MPI_Ibcast` or `MPI_Bcast`, and it did not matter which one of them was used. The error was not reproducible with GNU Fortran + OpenMPI.
jimdempseyatthecove
Black Belt
481 Views

Alternate work around:

module test_intel_mpi_mod
   implicit none
   integer, parameter :: dp = kind(1.0d0)

   type :: Container
      complex(kind=dp), allocatable :: arr(:, :, :)
   end type

contains
   subroutine test_intel_mpi()
!      use mpi_f08, only: &
!         MPI_Init_thread, &
!         MPI_THREAD_SINGLE, &
!         MPI_Finalize, &
!         MPI_Comm_rank, &
!         MPI_COMM_WORLD, &
!         MPI_COMPLEX16, &
!         MPI_Bcast
      use mpi

      integer :: provided
      integer :: rank
      type(Container) :: cont
      integer :: ierror
      
      call MPI_Init_thread(MPI_THREAD_SINGLE, provided, ierror)
      call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierror)

      allocate(cont % arr(1, 1, 1))

      if (rank == 0) then
         cont % arr(1, 1, 1) = (1.0_dp, 2.0_dp)
      endif

! This works fine --->  call MPI_Bcast(cont % arr(1, 1, 1), 1, MPI_COMPLEX16, 0, MPI_COMM_WORLD)
!*bug* call MPI_Bcast(cont % arr(:, :, 1), 1, MPI_COMPLEX16, 0, MPI_COMM_WORLD, ierror)
      call my_Bcast(cont % arr(:, :, 1), ierror)
      print *, rank, " after Bcast: ", cont % arr(1, 1, 1)
      call MPI_Finalize()
   end subroutine test_intel_mpi
   subroutine my_Bcast(arr, ierror)
      use mpi
      complex(kind=dp) :: arr(*)
      integer :: ierror
      call MPI_Bcast(arr(1), 1, MPI_COMPLEX16, 0, MPI_COMM_WORLD, ierror)
   end subroutine my_Bcast
end module test_intel_mpi_mod
    

I used Intel's mpi module in test. in my_Bcast, arr(1) should equivalence to the lowest bound of cont%arr(:,:,1) of caller.

Jim Dempsey

Peil__Oleg
Beginner
481 Views

jimdempseyatthecove (Blackbelt) wrote:

Alternate work around:

Thanks for the tip. Yes, this could be an option. It is likely that any workaround with a mapping onto an effective 1D array will work.
jimdempseyatthecove
Black Belt
481 Views

After I wrote that, this wold be much better:

ASSOCIATE (blob => cont % arr(:, :, 1))
call MPI_Bcast(blob, 1, MPI_COMPLEX16, 0, MPI_COMM_WORLD, ierror)
END ASSOCIAT

Jim Dempsey

 

Peil__Oleg
Beginner
481 Views

jimdempseyatthecove (Blackbelt) wrote:

After I wrote that, this wold be much better:

ASSOCIATE (blob => cont % arr(:, :, 1))
call MPI_Bcast(blob, 1, MPI_COMPLEX16, 0, MPI_COMM_WORLD, ierror)
END ASSOCIAT

Jim Dempsey

 

This workaround indeed works. And it is definitely more elegant. I wonder what makes the compiler fail with the explicit expression.
Peil__Oleg
Beginner
481 Views

If this helps, below is another minimal example that explicitly demonstrates that the problem is in the Intel compiler itself and it has nothing to do with the MPI library. The issue seems to be related to the implementation of assumed-rank objects from Fortran 2015, which are required for `mpi_f08` interface.
module test_assumed_rank_mod
   implicit none
   integer, parameter :: dp = kind(1.0d0)

   type :: Container
      real(kind=dp), allocatable :: arr(:, :, :)
   end type

   interface
      subroutine c_fun(arr, n) bind(C, name='c_fun')
         use iso_c_binding, only: c_ptr, c_int
         type(c_ptr), value :: arr
         integer(c_int), value :: n
      end subroutine
   end interface
contains
   subroutine fun(arr, n)
      use iso_c_binding, only: c_ptr, c_int, c_loc

      !> arr will be treated as an array of real(dp)
      type(*), intent(inout) :: arr(..)
      integer, intent(in) :: n

      type(c_ptr) :: p_arr
      integer(c_int) :: c_n
      p_arr = c_loc(arr)
      c_n = n

      call c_fun(p_arr, c_n)
   end subroutine fun

   subroutine test_assumed_rank_arg()
      type(Container) :: cont

      allocate(cont % arr(1, 1, 1))

      cont % arr(1, 1, 1) = 42.0_dp

      print *, " before `c_fun`: ", cont % arr(1, 1, 1)

!---> works      call fun(cont % arr(:, 1, 1), 1)
      call fun(cont % arr(:, :, 1), 1)

      print *, " after `c_fun`: ", cont % arr(1, 1, 1)
   end subroutine test_assumed_rank_arg
end module test_assumed_rank_mod

program test_poly
   use test_assumed_rank_mod
   call test_assumed_rank_arg()
end program
The C-function is defined as follows (in c_fun.c):
#include <stdio.h>

void c_fun(void *p, int n) {
   // Assume that `p` points to an array of double
   double *p_arr = (double *)p;

   printf("n = %d\n", n);
   printf("`p` contains: %lf\n", *p_arr);
}
Compiled together, this results in the same behavior as in the MPI example. Should I re-post this to the Intel-Compiler forum instead?
jimdempseyatthecove
Black Belt
481 Views

>>Should I re-post this to the Intel-Compiler forum instead?

Prasanth should be able to take it from here.

If Prasanth shows no followup, then in this forum, where you select different threads (not a reply to this thread), there is a tool button to report a bug. You can click on that and then post your sample (or hyper link to this thread) as a bug report.

Jim Dempsey

jimdempseyatthecove
Black Belt
481 Views

???? The Report Bug button is now missing ???

Make a new posting on the Fortran forum. include a hyper link to this thread.

https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/852075

Jim Dempsey

Peil__Oleg
Beginner
481 Views

I have made a separate post on the Fortran forum: https://software.intel.com/en-us/forums/intel-fortran-compiler/topic/852299 P. S. Not only am I unable to see a bug report button but also all HTML markup buttons have disappeared. Is it only my browser or something is really going on with the forum page?
jimdempseyatthecove
Black Belt
481 Views

>>but also all HTML markup buttons have disappeared.

I see that ocuring occasionally too.

When that happens, (with browser in focus) I press Ctrl-N to open a new browser window at the same URL. The new window has the buttons back (then close older browser session).

Jim Dempsey

PrasanthD_intel
Moderator
481 Views

Hi Oleg,

Since you have raised a separate thread on the Fortran forum, we are closing this thread here.

If needed you can always start a new thread in this forum.

Regards

Prasanth

 

Reply