- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The latest Intel MPI on the devCloud appears to have an issue with MPI-IO that is related somehow to reading overlapped data with MPI_FILE_READ_ALL. I'm getting the following segmentation fault when running the attached program on 9 or more processors:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
a.out 00000000004064EA Unknown Unknown Unknown
libpthread-2.31.s 00007FDCC59923C0 Unknown Unknown Unknown
libmpi.so.12.0.0 00007FDCC4772850 Unknown Unknown Unknown
libmpi.so.12.0.0 00007FDCC45AE6B9 Unknown Unknown Unknown
libmpi.so.12.0.0 00007FDCC481CBEF Unknown Unknown Unknown
libmpi.so.12.0.0 00007FDCC490D7C9 Unknown Unknown Unknown
libmpi.so.12.0.0 00007FDCC490613D Unknown Unknown Unknown
libmpi.so.12.0.0 00007FDCC426B023 Unknown Unknown Unknown
libmpi.so.12.0.0 00007FDCC482974A PMPI_Waitall Unknown Unknown
libmpi.so.12.0.0 00007FDCC4115803 Unknown Unknown Unknown
libmpi.so.12.0.0 00007FDCC411443D Unknown Unknown Unknown
libmpi.so.12.0.0 00007FDCC4B69324 PMPI_File_read_al Unknown Unknown
libmpifort.so.12. 00007FDCC626F458 pmpi_file_read_al Unknown Unknown
a.out 000000000040535C Unknown Unknown Unknown
a.out 0000000000404922 Unknown Unknown Unknown
libc-2.31.so 00007FDCC57B20B3 __libc_start_main Unknown Unknown
a.out 000000000040482E Unknown Unknown Unknown
The program simply writes a data file with integers and double precision reals, and then reads it back in again. This mimics what the actual code I am trying to use on the devCloud is doing. During both the write and the read some of the processors are overlapping data to/from the file.
Compile with:
ifx test_mpiio3.f90 -lmpifort
run with:
mpirun -np 9 ./a.out
Matt
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Matt,
Thanks for reaching out to us.
I was able to reproduce the issue from my side on certain nodes on devcloud. Could you please try on the e-2176g (coffelake) nodes on devcloud where I could observe the command works without issues. Please find the below command to request a coffelake node.
qsub -l nodes=1:e-2176g:ppn=2 -d . -I
Thanks
Arun
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Matt,
Have you tried running the application on a coffelake machine?
Thanks
Arun
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Arun,
Yes, I verified my test program works on coffeelake too. Unfortunately, the code I am actually trying to run still does not work on coffeelake. So my example is does not capture all the issues apparently. What I can tell you is I am able to get the main code working by adjusting the MPI_FILE_READ_ALL statements. The original code looks like this:
CALL MPI_FILE_SET_VIEW ( fid, rstdisp, MPI_RP, reading_type(1), 'NATIVE', MPI_INFO_NULL, ierr )
CALL MPI_FILE_READ_ALL ( fid, x, 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, y, 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, z, 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, varg, nvar, reading_type(2), MPI_STATUS_IGNORE, ierr )
IF ( itimeint >= 2 ) CALL MPI_FILE_READ_ALL ( fid, qold, nvar, reading_type(2), MPI_STATUS_IGNORE, ierr )
The first three READ_ALL statements work fine where "1" element of the 3-dimensional datatype is read, the issue occurs in the fourth statement where "nvar" elements of the 3-dimensional datatype are read into a 4D array. I am able to make it work by replacing the fourth (and fifth) READ_ALL with individual statements for each 3D entry:
CALL MPI_FILE_SET_VIEW ( fid, rstdisp, MPI_RP, reading_type(1), 'NATIVE', MPI_INFO_NULL, ierr )
CALL MPI_FILE_READ_ALL ( fid, x, 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, y, 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, z, 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,1), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,2), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,3), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,4), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, varg(:,:,:,5), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
IF ( itimeint >= 2 ) then
CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,1), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,2), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,3), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,4), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
CALL MPI_FILE_READ_ALL ( fid, qold(:,:,:,5), 1, reading_type(2), MPI_STATUS_IGNORE, ierr )
END IF
It may be that my example code is not generating a large enough dataset to exhibit the error on coffeelake, it may not have enough overlap in the data. But, hopefully it is still useful for tracking down the issue.
Matt
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have reported this issue to the concerned development team. They are looking into this issue.
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Your issue will be fixed soon in the Intel OneAPI future release. For the time being, I_MPI_FABRICS=ofi can be used as a workaround.
Use the below command for running your program:
I_MPI_FABRICS=ofi mpirun -np 9 ./a.out
If this resolves your issue, make sure to accept this as a solution. This would help others with similar issues. Thank you!
Best Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Thanks & Regards,
Santosh
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page