I got a deadlock when using IntelMPI on an NFS mounted share. I am running a Fortran program using HDF5. The error stack is quite long and it might be related to a bug in the HDF5 library. However, since it does not occur with other MPI implementation or on a locally mounted drive, I assume it is directly related to Intel MPI.
The trace is (most relevant information ADIOI_NFS_WRITECONTIG(53): Other I/O error Input/output error at the bottom)
HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 1: #000: H5Dio.c line 314 in H5Dwrite(): can't write data major: Dataset minor: Write failed #001: H5VLcallback.c line 2186 in H5VL_dataset_write(): dataset write failed major: Virtual Object Layer minor: Write failed #002: H5VLcallback.c line 2152 in H5VL__dataset_write(): dataset write failed major: Virtual Object Layer minor: Write failed #003: H5VLnative_dataset.c line 207 in H5VL__native_dataset_write(): can't write data major: Dataset minor: Write failed #004: H5Dio.c line 781 in H5D__write(): can't write data major: Dataset minor: Write failed #005: H5Dmpio.c line 735 in H5D__contig_collective_write(): couldn't finish shared collective MPI-IO major: Low-level I/O minor: Write failed #006: H5Dmpio.c line 2081 in H5D__inter_collective_io(): couldn't finish collective MPI-IO major: Low-level I/O minor: Can't get value #007: H5Dmpio.c line 2125 in H5D__final_collective_io(): optimized write failed major: Dataset minor: Write failed #008: H5Dmpio.c line 491 in H5D__mpio_select_write(): can't finish collective parallel write major: Low-level I/O minor: Write failed #009: H5Fio.c line 206 in H5F_shared_block_write(): write through page buffer failed major: Low-level I/O minor: Write failed #010: H5PB.c line 1032 in H5PB_write(): write through metadata accumulator failed major: Page Buffering minor: Write failed #011: H5Faccum.c line 827 in H5F__accum_write(): file write failed major: Low-level I/O minor: Write failed #012: H5FDint.c line 249 in H5FD_write(): driver write request failed major: Virtual File Layer minor: Write failed #013: H5FDmpio.c line 1467 in H5FD__mpio_write(): MPI_File_write_at_all failed major: Internal error (too specific to document in detail) minor: Some MPI function failed #014: H5FDmpio.c line 1467 in H5FD__mpio_write(): Other I/O error , error stack: ADIOI_NFS_WRITECONTIG(53): Other I/O error Input/output error major: Internal error (too specific to document in detail) minor: MPI Error String
I am using parallel studio XE 2020 update 2 cluster edition.
@IntelTeam: Please move to appropriate topic if required, I could neither find a parallel studio nor an IntelMPI topic.
Thanks for reaching out to us.
Will you please give us more details about your environment, like which OS you are using, where you have installed PSXE 2020u2.
It seems that you are using HDF5 1.12.0 so we also want to know are you trying to build HDF5 using IntelMPI and Fortran.
And if you have already installed HDF5 then please verify if have you have used --enable-cxx, --enable-fortran, --enable-parallel features.
We have tried building HDF5 on NFS with Intel MPI and Fortran. We cannot see any errors while building. We also tried executing some Fortran samples using HDF5 but cannot see any errors, all of them are executing successfully.
So please give us the above details and the procedure that you followed for getting those error logs.
I'm using Ubuntu 20.04 on a Intel(R) Xeon(R) CPU E5-2687W 0.
HDF5 is indeed 1.12.0. Installation was not a problem, but I'm quite sure that I did not run 'make test'. The configure flags are '-enable-parallel --enable-fortran --enable-build-mode=production', -fPIC was also used.
It seems that the problem is related to NFS. This has been reported before. Everything works fine on local drives (and apparently also with other MPI implementations like openMPI and MPICH).
Please try doing make test on the MPI-related samples, and also give us details where you have seen this as a reported issue so that it will become easier for us to reproduce it.
unfornuately, HDF5 was installed long ago so I can't run 'make test' anymore.
I found the following reports on deadlocks with NFS and/or HDF5