Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

MPI-IO issue on devCloud

Matthew_Grismer
Beginner
743 Views

The latest Intel MPI on the devCloud appears to have an issue with MPI-IO that is related somehow to reading overlapped data with MPI_FILE_READ_ALL.  I'm getting the following segmentation fault when running the attached program on 9 or more processors:

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image              PC                Routine            Line        Source             

a.out              00000000004064EA  Unknown               Unknown  Unknown

libpthread-2.31.s  00007FDCC59923C0  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC4772850  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC45AE6B9  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC481CBEF  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC490D7C9  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC490613D  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC426B023  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC482974A  PMPI_Waitall          Unknown  Unknown

libmpi.so.12.0.0   00007FDCC4115803  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC411443D  Unknown               Unknown  Unknown

libmpi.so.12.0.0   00007FDCC4B69324  PMPI_File_read_al     Unknown  Unknown

libmpifort.so.12.  00007FDCC626F458  pmpi_file_read_al     Unknown  Unknown

a.out              000000000040535C  Unknown               Unknown  Unknown

a.out              0000000000404922  Unknown               Unknown  Unknown

libc-2.31.so       00007FDCC57B20B3  __libc_start_main     Unknown  Unknown

a.out              000000000040482E  Unknown               Unknown  Unknown

 

The program simply writes a data file with integers and double precision reals, and then reads it back in again.  This mimics what the actual code I am trying to use on the devCloud is doing. During both the write and the read some of the processors are overlapping data to/from the file.  

 

Compile with:

 

ifx test_mpiio3.f90 -lmpifort

 

run with:

 

mpirun -np 9 ./a.out 

 

Matt

0 Kudos
6 Replies
ArunJ_Intel
Moderator
726 Views

Hi Matt,

 

Thanks for reaching out to us.

I was able to reproduce the issue from my side on certain nodes on devcloud. Could you please try on the e-2176g (coffelake) nodes on devcloud where I could observe the command works without issues. Please find the below command to request a coffelake node.

 

qsub -l nodes=1:e-2176g:ppn=2 -d . -I

 

 

Thanks

Arun

 

ArunJ_Intel
Moderator
693 Views

Hi Matt,


Have you tried running the application on a coffelake machine?


Thanks

Arun


Matthew_Grismer
Beginner
681 Views

Arun,

Yes, I verified my test program works on coffeelake too.  Unfortunately, the code I am actually trying to run still does not work on coffeelake.  So my example is does not capture all the issues apparently.  What I can tell you is I am able to get the main code working by adjusting the MPI_FILE_READ_ALL statements.  The original code looks like this:

      CALL MPI_FILE_SET_VIEW ( fid, rstdisp, MPI_RP, reading_type(1), 'NATIVE', MPI_INFO_NULL, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    x,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    y,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    z,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg, nvar, reading_type(2), MPI_STATUS_IGNORE, ierr )

      IF ( itimeint >= 2 ) CALL MPI_FILE_READ_ALL ( fid,   qold, nvar, reading_type(2), MPI_STATUS_IGNORE, ierr )

The first three READ_ALL statements work fine where "1" element of the 3-dimensional datatype is read, the issue occurs in the fourth statement where "nvar" elements of the 3-dimensional datatype are read into a 4D array.  I am able to make it work by replacing the fourth (and fifth) READ_ALL with individual statements for each 3D entry:

      CALL MPI_FILE_SET_VIEW ( fid, rstdisp, MPI_RP, reading_type(1), 'NATIVE', MPI_INFO_NULL, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    x,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    y,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid,    z,    1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,1), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,2), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,3), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,4), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,5), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      IF ( itimeint >= 2 ) then

        CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,1), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

        CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,2), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

        CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,3), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

        CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,4), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

        CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,5), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )

      END IF

It may be that my example code is not generating a large enough dataset to exhibit the error on coffeelake, it may not have enough overlap in the data.  But, hopefully it is still useful for tracking down the issue.

 

Matt

SantoshY_Intel
Moderator
610 Views

Hi,


We have reported this issue to the concerned development team. They are looking into this issue.


Thanks & Regards,

Santosh


SantoshY_Intel
Moderator
326 Views

Hi,

 

Your issue will be fixed soon in the Intel OneAPI future release. For the time being, I_MPI_FABRICS=ofi can be used as a workaround.

 

Use the below command for running your program:

I_MPI_FABRICS=ofi mpirun -np 9 ./a.out 

 

If this resolves your issue, make sure to accept this as a solution. This would help others with similar issues. Thank you!

 

Best Regards,

Santosh

 

 

SantoshY_Intel
Moderator
298 Views

Hi,


I assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards,

Santosh


Reply