Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

IntelMPI deadlock on NFS

Diehl__Martin
Novice
2,819 Views

Hi,

I got a deadlock when using IntelMPI on an NFS mounted share. I am running a Fortran program using HDF5. The error stack is quite long and it might be related to a bug in the HDF5 library. However, since it does not occur with other MPI implementation or on a locally mounted drive, I assume it is directly related to Intel MPI.

The trace is (most relevant information ADIOI_NFS_WRITECONTIG(53): Other I/O error Input/output error at the bottom)

HDF5-DIAG: Error detected in HDF5 (1.12.0) MPI-process 1:
  #000: H5Dio.c line 314 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: H5VLcallback.c line 2186 in H5VL_dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #002: H5VLcallback.c line 2152 in H5VL__dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #003: H5VLnative_dataset.c line 207 in H5VL__native_dataset_write(): can't write data
    major: Dataset
    minor: Write failed
  #004: H5Dio.c line 781 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #005: H5Dmpio.c line 735 in H5D__contig_collective_write(): couldn't finish shared collective MPI-IO
    major: Low-level I/O
    minor: Write failed
  #006: H5Dmpio.c line 2081 in H5D__inter_collective_io(): couldn't finish collective MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #007: H5Dmpio.c line 2125 in H5D__final_collective_io(): optimized write failed
    major: Dataset
    minor: Write failed
  #008: H5Dmpio.c line 491 in H5D__mpio_select_write(): can't finish collective parallel write
    major: Low-level I/O
    minor: Write failed
  #009: H5Fio.c line 206 in H5F_shared_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #010: H5PB.c line 1032 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #011: H5Faccum.c line 827 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #012: H5FDint.c line 249 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #013: H5FDmpio.c line 1467 in H5FD__mpio_write(): MPI_File_write_at_all failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #014: H5FDmpio.c line 1467 in H5FD__mpio_write(): Other I/O error , error stack:
ADIOI_NFS_WRITECONTIG(53): Other I/O error Input/output error
    major: Internal error (too specific to document in detail)
    minor: MPI Error String

 

I am using parallel studio XE 2020 update 2 cluster edition.

 

@IntelTeam: Please move to appropriate topic if required, I could neither find a parallel studio nor an IntelMPI topic.

 

 

0 Kudos
9 Replies
AbhishekD_Intel
Moderator
2,798 Views

Hi Martin,


Thanks for reaching out to us.

Will you please give us more details about your environment, like which OS you are using, where you have installed PSXE 2020u2.

It seems that you are using HDF5 1.12.0 so we also want to know are you trying to build HDF5 using IntelMPI and Fortran.

And if you have already installed HDF5 then please verify if have you have used --enable-cxx, --enable-fortran, --enable-parallel features.


We have tried building HDF5 on NFS with Intel MPI and Fortran. We cannot see any errors while building. We also tried executing some Fortran samples using HDF5 but cannot see any errors, all of them are executing successfully.


So please give us the above details and the procedure that you followed for getting those error logs.



Warm Regards,

Abhishek


0 Kudos
Diehl__Martin
Novice
2,783 Views

Hi Abishek,

I'm using Ubuntu 20.04 on a Intel(R) Xeon(R) CPU E5-2687W 0.

HDF5 is indeed 1.12.0. Installation was not a problem, but I'm quite sure that I did not run 'make test'. The configure flags are '-enable-parallel --enable-fortran --enable-build-mode=production', -fPIC was also used.

It seems that the problem is related to NFS. This has been reported before. Everything works fine on local drives (and apparently also with other MPI implementations like openMPI and MPICH).

 

best regards,

Martin

0 Kudos
AbhishekD_Intel
Moderator
2,748 Views

Hi Martin,


Please try doing make test on the MPI-related samples, and also give us details where you have seen this as a reported issue so that it will become easier for us to reproduce it.



Warm Regards,

Abhishek


0 Kudos
Diehl__Martin
Novice
2,740 Views
0 Kudos
AbhishekD_Intel
Moderator
2,720 Views

Thanks for the details.

We are forwarding this issue to the SME, they will guide you on this issue.



Warm Regards,

Abhishek


0 Kudos
Vinutha_SV
Moderator
2,392 Views

Hi,

We are unable to re produce this on our end. You can provide a reproducer?


0 Kudos
Diehl__Martin
Novice
2,379 Views

Many thanks for looking into this, but I cannot reproduce it either. But our system setup has also changes, it's more than half a year.

0 Kudos
AbhishekD_Intel
Moderator
2,351 Views

Hi Martin,


Please let us know if we can close this thread, as the issue is unable to reproduce.

Also please post back to us with a new thread if you encounter a similar issue.



Warm Regards,

Abhishek


0 Kudos
AbhishekD_Intel
Moderator
2,330 Views

Hi,

 

As we haven't heard back from you for a long time.

We will no longer monitor this thread. If you require any additional assistance from Intel, please start a new thread.

 

Warm Regards,

Abhishek

 

0 Kudos
Reply