I have an MPI application in Fortran that use the HDF5 library. I am currently using version 2019.9.304 of Intel MPI (on RHEL 7.7).
When on Lustre filesystems I set the environment variable I_MPI_EXTRA_FILESYSTEM=1 to have MPI enabling the Lustre filesystem support/optimizations. This usually works fine.
In one particular case I encounter a problem deep, deep inside both the HDF5 and further in the MPI library when calling h5fclose_f, after having opened a file, created two groups, written a few datasets and then closing the file again. The application is using the HDF5 library in collective IO mode.
On one rank (rank 0) i get:
Request pending due to failure, error stack:
PMPI_Waitall(346): MPI_Waitall(count=2, req_array=0x5581dc40, status_array=0x1) failed
PMPI_Waitall(322): The supplied request in array element 0 was invalid (kind=4)
On two other ranks I get:
Request pending due to failure, error stack:
PMPI_Waitall(346): MPI_Waitall(count=1, req_array=0x529fe830, status_array=0x1) failed
PMPI_Waitall(322): The supplied request in array element 0 was invalid (kind=0)
Other ranks are fine, no errors. Notice different "kind=" in the error messages.
When turning off the I_MPI_EXTRA_FILESYSTEM everything works fine.
I can create a full backtrace from a custom MPI error handler, the backtrace is the same on all three failing ranks:
#2 0x7f86b92d39a9 in MPIR_Err_return_comm at ../../src/mpi/errhan/errutil.c:321 #3 0x7f86b9a3bf3a in PMPI_Waitall at ../../src/mpi/request/waitall.c:351 #4 0x7f86b901ec4a in ADIOI_LUSTRE_W_Exchange_data at ../../../../../src/mpi/romio/adio/ad_lustre/ad_lustre_wrcoll.c:952 #5 0x7f86b901d997 in ADIOI_LUSTRE_Exch_and_write at ../../../../../src/mpi/romio/adio/ad_lustre/ad_lustre_wrcoll.c:642 #6 0x7f86b901c52f in ADIOI_LUSTRE_WriteStridedColl at ../../../../../src/mpi/romio/adio/ad_lustre/ad_lustre_wrcoll.c:322 #7 0x7f86ba1f0bd7 in MPIOI_File_write_all at ../../../../../src/mpi/romio/mpi-io/write_all.c:114 #8 0x7f86ba1f0cbb in PMPI_File_write_at_all at ../../../../../src/mpi/romio/mpi-io/write_atall.c:58 #9 0x7f86bbd224ad in H5FD_mpio_write at /opt/hdf5/1.10.7/source/src/H5FDmpio.c:1636 #10 0x7f86bba7288f in H5FD_write at /opt/hdf5/1.10.7/source/src/H5FDint.c:248 #11 0x7f86bba3c094 in H5F__accum_write at /opt/hdf5/1.10.7/source/src/H5Faccum.c:823 #12 0x7f86bbbd89df in H5PB_write at /opt/hdf5/1.10.7/source/src/H5PB.c:1031 #13 0x7f86bba4ac6d in H5F_block_write at /opt/hdf5/1.10.7/source/src/H5Fio.c:160 #14 0x7f86bbd11781 in H5C__collective_write at /opt/hdf5/1.10.7/source/src/H5Cmpio.c:1109 #15 0x7f86bbd13223 in H5C_apply_candidate_list at /opt/hdf5/1.10.7/source/src/H5Cmpio.c:402 #16 0x7f86bbd0e960 in H5AC__rsp__dist_md_write__flush at /opt/hdf5/1.10.7/source/src/H5ACmpio.c:1707 #17 0x7f86bbd10651 in H5AC__run_sync_point at /opt/hdf5/1.10.7/source/src/H5ACmpio.c:2181 #18 0x7f86bbd10739 in H5AC__flush_entries at /opt/hdf5/1.10.7/source/src/H5ACmpio.c:2324 #19 0x7f86bb93a2e3 in H5AC_flush at /opt/hdf5/1.10.7/source/src/H5AC.c:740 #20 0x7f86bba406fe in H5F__flush_phase2 at /opt/hdf5/1.10.7/source/src/H5Fint.c:1988 #21 0x7f86bba4344c in H5F__dest at /opt/hdf5/1.10.7/source/src/H5Fint.c:1255 #22 0x7f86bba44266 in H5F_try_close at /opt/hdf5/1.10.7/source/src/H5Fint.c:2345 #23 0x7f86bba44727 in H5F__close_cb at /opt/hdf5/1.10.7/source/src/H5Fint.c:2172 #24 0x7f86bbb04868 in H5I_dec_ref at /opt/hdf5/1.10.7/source/src/H5I.c:1261 #25 0x7f86bbb04956 in H5I_dec_app_ref at /opt/hdf5/1.10.7/source/src/H5I.c:1306 #26 0x7f86bba43e9b in H5F__close at /opt/hdf5/1.10.7/source/src/H5Fint.c:2112 #27 0x7f86bba33147 in H5Fclose at /opt/hdf5/1.10.7/source/src/H5F.c:594 #28 0x9bcb00 in h5fclose_c at /opt/hdf5/1.10.7/source/fortran/src/H5Ff.c:476 #29 0x99c42a in __h5f_MOD_h5fclose_f at /opt/hdf5/1.10.7/source/fortran/src/H5Fff.F90:575
Do any of you have any ideas on the source of this error? Somehow I have a vague feeling that this could be a bug in the MPI implementation, but this is just a feeling. The workaround is obvious, just unset I_MPI_EXTRA_FILESYSTEM, but then there is no parallel IO any more...
Could you please provide a sample reproducer of your program, so we can reproduce the issue in our LFS file system and confirm whether it's a bug or not.
I'll try to, but it's not given that I'll manage to make it. The failing code is a code that takes a geometry defined by triangles and intersect this with a Cartesian mesh (140 M cells) when the geometry is rotated step by step. Information on the intersections are then stored in the HDF5 file each step, writing new datasets into the file. Datasets are never deleted. Each time a step is finished computing the HDF5 file is opened for writing/appending and closed completely to allow the user to stop the process without compromising data.
The problem arise not at the first step, but after a while. At this stage the HDF5 file is ~6 GB or so. I can start the process arbitrarily, but when I start it at the failing step or at a step or two before, there is no problem. The problem thus only seem to arise if the existing HDF5 file that is opened have certain data structures/shape/size already.
I'll see if I'm able to reproduce, but cannot promise anything..
Thanks for understanding.
It will help us a lot if you provide a simple sample reproducer that contains at least one of the data structures/shape and HDF5 file commands. So, we can understand more about the error.
I got to know that there is no need to enable I_MPI_EXTRA_FILESYSTEM in the latest MPI versions. The latest versions natively support Lustre file systems. As mentioned in release notes: Intel® MPI Library Release Notes for Linux* OS
"Parallel file systems (GPFS, Lustre, Panfs) are supported natively, removed bindings libraries (removed I_MPI_EXTRA_FILESYSTEM*, I_MPI_LUSTRE* variables)."
As you have said disabling I_MPI_EXTRA_FILESYSTEM disables parallel IO as long you use MPI IO there should be no problem. If you still see any performance drop when you don't enable I_MPI_EXTRA_FILESYSTEM, please let us know.
We are closing this thread, assuming your issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread.
Any further interaction in this thread will be considered community only.