Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
95 Views

IntelMPI deadlock on NFS

Hi,

I got a deadlock when using IntelMPI on an NFS mounted share. I am running a Fortran program using HDF5. The error stack is quite long and it might be related to a bug in the HDF5 library. However, since it does not occur with other MPI implementation or on a locally mounted drive, I assume it is directly related to Intel MPI.

The trace is (most relevant information ADIOI_NFS_WRITECONTIG(53): Other I/O error Input/output error at the bottom)

HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 1:
  #000: H5Dio.c line 314 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: H5VLcallback.c line 2186 in H5VL_dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #002: H5VLcallback.c line 2152 in H5VL__dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #003: H5VLnative_dataset.c line 207 in H5VL__native_dataset_write(): can't write data
    major: Dataset
    minor: Write failed
  #004: H5Dio.c line 781 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #005: H5Dmpio.c line 735 in H5D__contig_collective_write(): couldn't finish shared collective MPI-IO
    major: Low-level I/O
    minor: Write failed
  #006: H5Dmpio.c line 2081 in H5D__inter_collective_io(): couldn't finish collective MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #007: H5Dmpio.c line 2125 in H5D__final_collective_io(): optimized write failed
    major: Dataset
    minor: Write failed
  #008: H5Dmpio.c line 491 in H5D__mpio_select_write(): can't finish collective parallel write
    major: Low-level I/O
    minor: Write failed
  #009: H5Fio.c line 206 in H5F_shared_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #010: H5PB.c line 1032 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #011: H5Faccum.c line 827 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #012: H5FDint.c line 249 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #013: H5FDmpio.c line 1467 in H5FD__mpio_write(): MPI_File_write_at_all failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #014: H5FDmpio.c line 1467 in H5FD__mpio_write(): Other I/O error , error stack:
ADIOI_NFS_WRITECONTIG(53): Other I/O error Input/output error
    major: Internal error (too specific to document in detail)
    minor: MPI Error String

 

I am using parallel studio XE 2020 update 2 cluster edition.

 

@IntelTeam: Please move to appropriate topic if required, I could neither find a parallel studio nor an IntelMPI topic.

 

 

0 Kudos
4 Replies
Highlighted
Moderator
74 Views

Hi Martin,


Thanks for reaching out to us.

Will you please give us more details about your environment, like which OS you are using, where you have installed PSXE 2020u2.

It seems that you are using HDF5 1.12.0 so we also want to know are you trying to build HDF5 using IntelMPI and Fortran.

And if you have already installed HDF5 then please verify if have you have used --enable-cxx, --enable-fortran, --enable-parallel features.


We have tried building HDF5 on NFS with Intel MPI and Fortran. We cannot see any errors while building. We also tried executing some Fortran samples using HDF5 but cannot see any errors, all of them are executing successfully.


So please give us the above details and the procedure that you followed for getting those error logs.



Warm Regards,

Abhishek


0 Kudos
Highlighted
Beginner
59 Views

Hi Abishek,

I'm using Ubuntu 20.04 on a Intel(R) Xeon(R) CPU E5-2687W 0.

HDF5 is indeed 1.12.0. Installation was not a problem, but I'm quite sure that I did not run 'make test'. The configure flags are '-enable-parallel --enable-fortran --enable-build-mode=production', -fPIC was also used.

It seems that the problem is related to NFS. This has been reported before. Everything works fine on local drives (and apparently also with other MPI implementations like openMPI and MPICH).

 

best regards,

Martin

0 Kudos
Highlighted
Moderator
18 Views

Hi Martin,


Please try doing make test on the MPI-related samples, and also give us details where you have seen this as a reported issue so that it will become easier for us to reproduce it.



Warm Regards,

Abhishek


0 Kudos
Highlighted
Beginner
10 Views

Dear Abhishek,

unfornuately, HDF5 was installed long ago so I can't run 'make test' anymore.

I found the following reports on deadlocks with NFS and/or HDF5

https://bitbucket.org/fathomteam/moab/issues/92/deadlock-native-hdf5-2-ranks-intel-mpi

https://forum.hdfgroup.org/t/hang-for-mpi-hdf5-in-parallel-on-an-nfs-system/6541

https://forum.hdfgroup.org/t/hdf5-files-on-nfs/3985

 

 

best regards,

Martin

0 Kudos