Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
For the latest information on Intel’s response to the Log4j/Log4Shell vulnerability, please see Intel-SA-00646

MPI-IO issue on devCloud

Matthew_Grismer
Beginner
402 Views

The latest Intel MPI on the devCloud appears to have an issue with MPI-IO that is related somehow to reading overlapped data with MPI_FILE_READ_ALL.  I'm getting the following segmentation fault when running the attached program on 9 or more processors:

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image              PC                Routine            Line        Source             

a.out              00000000004064EA  Unknown               Unknown  Unknown

libpthread-2.31.s  00007FDCC59923C0  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC4772850  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC45AE6B9  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC481CBEF  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC490D7C9  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC490613D  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC426B023  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC482974A  PMPI_Waitall          Unknown  Unknown

libmpi.so.12.0.0   00007FDCC4115803  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC411443D  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC4B69324  PMPI_File_read_al     Unknown  Unknown

libmpifort.so.12.  00007FDCC626F458  pmpi_file_read_al     Unknown  Unknown

a.out              000000000040535C  Unknown               Unknown  Unknown

a.out              0000000000404922  Unknown               Unknown  Unknown

libc-2.31.so       00007FDCC57B20B3  __libc_start_main     Unknown  Unknown

a.out              000000000040482E  Unknown               Unknown  Unknown

 

The program simply writes a data file with integers and double precision reals, and then reads it back in again.  This mimics what the actual code I am trying to use on the devCloud is doing. During both the write and the read some of the processors are overlapping data to/from the file.  

 

Compile with:

 

ifx test_mpiio3.f90 -lmpifort

 

run with:

 

mpirun -np 9 ./a.out 

 

Matt

0 Kudos
4 Replies
ArunJ_Intel
Moderator
385 Views

Hi Matt,

 

I was able to reproduce the issue from my side on certain nodes in devcloud. Could you please try on the e-2176g (coffelake) nodes on devcloud where I could observe the command works without issues. Please find below the command to request a coffelake node.

qsub -l nodes=1:e-2176g:ppn=2 -d . -I

 

Thanks

Arun

 

ArunJ_Intel
Moderator
352 Views

Hi Matt,


Have you tried running the application on a coffelake machine?


Thanks

Arun


Matthew_Grismer
Beginner
340 Views

Arun,

Yes, I verified my test program works on coffeelake too.  Unfortunately, the code I am actually trying to run still does not work on coffeelake.  So my example is does not capture all the issues apparently.  What I can tell you is I am able to get the main code working by adjusting the MPI_FILE_READ_ALL statements.  The original code looks like this:

      CALL MPI_FILE_SET_VIEW ( fid, rstdisp, MPI_RP, reading_type(1), 'NATIVE', MPI_INFO_NULL, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    x,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    y,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    z,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg, nvar, reading_type(2), MPI_STATUS_IGNORE, ierr )

      IF ( itimeint >= 2 ) CALL MPI_FILE_READ_ALL ( fid,   qold, nvar, reading_type(2), MPI_STATUS_IGNORE, ierr )

The first three READ_ALL statements work fine where "1" element of the 3-dimensional datatype is read, the issue occurs in the fourth statement where "nvar" elements of the 3-dimensional datatype are read into a 4D array.  I am able to make it work by replacing the fourth (and fifth) READ_ALL with individual statements for each 3D entry:

      CALL MPI_FILE_SET_VIEW ( fid, rstdisp, MPI_RP, reading_type(1), 'NATIVE', MPI_INFO_NULL, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    x,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    y,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    z,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,1), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,2), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,3), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,4), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,5), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      IF ( itimeint >= 2 ) then

        CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,1), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

        CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,2), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

        CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,3), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

        CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,4), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

        CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,5), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      END IF

It may be that my example code is not generating a large enough dataset to exhibit the error on coffeelake, it may not have enough overlap in the data.  But, hopefully it is still useful for tracking down the issue.

 

Matt

SantoshY_Intel
Moderator
269 Views

Hi,


We have reported this issue to the concerned development team. They are looking into this issue.


Thanks & Regards,

Santosh


Reply