Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Michael_R_2
Beginner
69 Views

Bug in mpi_f08-module with IMPI-5.1.1.109&Intel-16.0.0 on Linux-cluster

Dear developers of IMPI

 

I observed a bug when using the mpi_f08-module with IMPI-5.1.1.109 & intel-16.0.0

when running a Fortran program with 2 MPI-procs on a LINUX-cluster.

Data are not correctly transmitted by MPI_GATHER and by MPI_BCAST.

 

 

a) MPI_GATHER: The following bug occurred only with mpi_f08-module, whereas with mpi-module it worked

 

A simplified code snippet looks like that:

 

integer , parameter :: mxhostlen=128

character(len=mxhostlen) :: HOST_NAME

character(len=mxhostlen), allocatable, dimension(:) :: nodename_from_irankWORLD

 

 

! Note: numprocsWORLD is the number of MPI-procs running

if(lmaster) then ! <-- rank 0

allocate( nodename_from_irankWORLD(0:numprocsWORLD-1) )

else

allocate( nodename_from_irankWORLD(0:0) ) ! for saving storage on the slave procs

endif

call MPI_GATHER( HOST_NAME , mxhostlen, MPI_CHARACTER &

,nodename_from_irankWORLD, mxhostlen, MPI_CHARACTER &

,0_INT4, MPI_COMM_WORLD, ierr_mpi )

 

Using this for gathering the hostnames from each process on the master process, I get:

 

[0] Fatal error in PMPI_Gather: Message truncated, error stack:

[0] PMPI_Gather(1303).......: MPI_Gather(sbuf=0xa8e160, scount=128, MPI_CHARACTER, rbuf=0x26402d0, rcount=1, MPI_CHARACTER, root=0, MPI_COMM_WORLD) failed

[0] MPIR_Gather_impl(728)...:

[0] MPIR_Gather(682)........:

[0] I_MPIR_Gather_intra(822):

[0] MPIR_Gather_intra(187)..:

[0] MPIR_Localcopy(125).....: Message truncated; 128 bytes received but buffer size is 1

 

You see, that the value of rcount= is 1 but should be 128 (=mxhostlen).

 

If I change the call of MPI_GATHER into this stmt:

call MPI_GATHER( HOST_NAME , mxhostlen, MPI_CHARACTER &

,nodename_from_irankWORLD(0), mxhostlen, MPI_CHARACTER &

,0_INT4, MPI_COMM_WORLD, ierr_mpi )

then it works, but nevertheless this is also a bug, because there must not be any influence,

whether the starting address of the receiving choice buffer

is actually the starting address of the array or the address of its 1st array element

or the address of a (sufficiently long) variable)

 

 

 

b) MPI_BCAST: the following bug occurred only with mpi_f08-module, whereas with mpi-module it worked

 

A simplified code snippet looks like that:

 

integer , parameter :: mxpathlen=512

character(len=mxpathlen), save :: CONF_DIR

character(len=mxpathlen), dimension(1) :: cbuffarr

integer :: nelem, lenelem, lentot, ierr_mpi

 

nelem=1

lenelem=mxpathlen

lentot= nelem * lenelem ! total number of characters to be transmitted

!!! call MPI_BCAST( CONF_DIR , lentot, MPI_CHARACTER, 0, MPI_COMM_WORLD, ierr_mpi ) ! <--did work

cbuffarr(1)= CONF_DIR

call MPI_BCAST( cbuffarr , lentot, MPI_CHARACTER, 0, MPI_COMM_WORLD, ierr_mpi ) ! <--did not work

!!! call MPI_BCAST( cbuffarr(1), lentot, MPI_CHARACTER, 0, MPI_COMM_WORLD, ierr_mpi ) ! <--did work

CONF_DIR=cbuffarr(1)

 

Using this to transmit a string from the master to all slaves,

I get not an error message, but the string sent is not received on the slaves!

 

 

Possibly these bugs are also in the interfaces of other MPI-routines of the mpi_f08 module?

 

Greetings

Michael

0 Kudos
2 Replies
Gergana_S_Intel
Employee
69 Views

Hi Michael,

Thanks for letting us know.  We recently received another issue with a similar root cause so I've added your name and reproducer to our internal bug.  The problem seems to stem from Intel Fortran compiler's newly added support for F08.  I'll update again once the fortran engineers have had a chance to investigate.

Regards,
~Gergana

Gergana_S_Intel
Employee
69 Views

Hi Michael,

Just a heads-up since we've  had a development on this internally.  The Intel Fortran Compiler team determined what the issue was and have fixed it in an internal build.  This fix for F08 will be included in one of the upcoming releases for the compiler (likely an update to the 16.0 version).

Once the next release is out, you're welcome to download it and test it.

Regards,
~Gergana

Reply