- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
## Summary
Setting `I_MPI_FILESYSTEM_FORCE=ufs` with shared-file MPI-IO (N-to-1 pattern) on NFS filesystems causes `ADIO_OPEN` failures and `MPI_Abort` crashes on Intel MPI 2021.17 (oneAPI 2025.3). Additionally, Intel MPI 2021.11 (oneAPI 2024.0) exhibits shared-file ROMIO failures at multi-node scale regardless of environment variable settings. The upstream ROMIO variable `ROMIO_FSTYPE_FORCE="ufs:"` works correctly on 2021.17 as a workaround, but does not resolve the 2021.11 failures.
## Environment
- **OS:** Rocky Linux 8.9
- **Kernel:** 4.18.0 (x86_64)
- **Compiler:** Intel oneAPI 2022 (icc 2021.6.0 / ifort 2021.6.0)
- **Intel MPI versions tested:**
- 2021.10 (oneAPI 2024.0)
- 2021.11 (oneAPI 2024.0) — `Intel(R) MPI Library for Linux* OS, Version 2021.11`
- 2021.17 (oneAPI 2025.3) — `Intel(R) MPI Library for Linux* OS, Version 2021.17 Build 20251215`
- **Filesystem:** NFS v3 over RDMA, `nconnect=32`
- **Cluster:** Up to 8 nodes, 28 cores/node (up to 224 MPI ranks), InfiniBand interconnect (Mellanox ConnectX-6)
- **Job scheduler:** SLURM 25.11.0
- **Reproducers:**
- IOR 4.0.0 (`-a MPIIO` shared-file mode, without `-F`). IOR was compiled and dynamically linked against each Intel MPI version being tested (`ldd` verified `libmpi.so.12` resolves to the correct version).
- PNetCDF 1.14.1 fandc test (flush-and-close shared-file write), also compiled per-version.
## Steps to Reproduce
### Minimal reproducer using IOR shared-file mode
The bug can be reproduced with as few as 2 nodes. IOR must be compiled against the same Intel MPI version being tested.
```bash
#!/bin/bash
#SBATCH -N 2
#SBATCH -n 2
#SBATCH --ntasks-per-node=1
#SBATCH -p <partition>
module load intel/2022 intelmpi/2021.17
# Confirm version
mpirun --version
# Expected: Intel(R) MPI Library for Linux* OS, Version 2021.17 Build 20251215
# Test 1: Baseline shared-file (PASSES)
mpirun -np 2 \
./ior -a MPIIO -e -g -t 1m -b 64m -w \
-o /path/to/nfs/testfile_baseline
# Test 2: I_MPI_FILESYSTEM_FORCE=ufs on shared-file (CRASHES)
mpirun -np 2 \
-env I_MPI_FILESYSTEM_FORCE ufs \
./ior -a MPIIO -e -g -t 1m -b 64m -w \
-o /path/to/nfs/testfile_ufs
# Test 3: ROMIO_FSTYPE_FORCE workaround on shared-file (PASSES)
mpirun -np 2 \
-env ROMIO_FSTYPE_FORCE "ufs:" \
./ior -a MPIIO -e -g -t 1m -b 64m -w \
-o /path/to/nfs/testfile_romio
```
**Key points for reproduction:**
- IOR must be run **without** `-F` (i.e., shared-file / N-to-1 mode, not file-per-process)
- Variables must be passed via `mpirun -env VAR value` (space-separated, not `=`), to ensure propagation to remote ranks under SLURM
- The bug manifests with as few as 2 ranks on 2 nodes; it scales with node count (28 errors at 8 nodes / 224 ranks)
- IOR 4.0.0 was used; any MPI-IO application that opens a shared file via `MPI_File_open` should reproduce this
### Alternate reproducer using PNetCDF fandc
```bash
# Build PNetCDF 1.14.1 against Intel MPI 2021.11 or 2021.17
cd pnetcdf-1.14.1/test/fandc/
# Run with I_MPI_FILESYSTEM_FORCE=ufs on NFS mount
mpirun -np 224 \
-env I_MPI_FILESYSTEM_FORCE ufs \
./exe.intel2022_impi2021.17
```
## Observed Behavior
### Intel MPI 2021.17 (oneAPI 2025.3)
| Test Configuration | Scale | ADIO Errors | Outcome |
|---|---|---|---|
| Shared-file, no tuning (baseline) | 8 nodes / 224 PE | **0** | PASS — 4.1 GB/s write, 5.8 GB/s read |
| Shared-file + `I_MPI_FILESYSTEM_FORCE=ufs` | 2 nodes / 2 PE | **2** | **FAIL — MPI_Abort** |
| Shared-file + `I_MPI_FILESYSTEM_FORCE=ufs` | 8 nodes / 224 PE | **28** | **FAIL — MPI_Abort, total failure** |
| Shared-file + `I_MPI_FILESYSTEM_FORCE=ufs` + `I_MPI_FILESYSTEM_NFS_DIRECT=enable` | 8 nodes / 224 PE | **28** | **FAIL — MPI_Abort, total failure** |
| Shared-file + `ROMIO_FSTYPE_FORCE="ufs:"` | 8 nodes / 224 PE | **0** | PASS — 4.1 GB/s write, 5.7 GB/s read |
| File-per-process (`-F`) + `I_MPI_FILESYSTEM_FORCE=ufs` | 8 nodes / 224 PE | **0** | PASS — works correctly |
**Error output (2021.17, 2-node reproducer):**
```
ERROR: cannot open file: /path/to/nfs/testfile, MPI Other I/O error , error stack:
internal_File_open(3211): MPI_File_open(comm=0x84000002, filename=/path/to/nfs/testfile, amode=37, info=0x9c000000, fh=0x93ef00) failed
ADIO_OPEN(535)..........: open failed on a remote node, (aiori-MPIIO.c:236)
ERROR: cannot open file: /path/to/nfs/testfile, MPI File does not exist, error stack:
internal_File_open(3211): MPI_File_open(comm=0x84000002, filename=/path/to/nfs/testfile, amode=37, info=0x9c000000, fh=0x11f87e0) failed
ADIOI_UFS_OPEN(37)......: File /path/to/nfs/testfile does not exist, (aiori-MPIIO.c:236)
Abort(-1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
```
**Error output (2021.17, 8-node / 224-rank run):**
At larger scale (8 nodes, 224 MPI ranks), 28 ADIO_OPEN errors are emitted before the collective `MPI_Abort`.
On 2021.17, the `I_MPI_FILESYSTEM_FORCE=ufs` setting causes a hard `MPI_Abort` — the job terminates immediately with 28 ADIO_OPEN errors across 8 nodes. The error stack suggests that when the UFS module is selected via this variable, the shared-file open path fails on remote nodes — rank 0 appears to proceed, but other ranks receive "File does not exist" errors.
Notably, the upstream ROMIO variable `ROMIO_FSTYPE_FORCE="ufs:"` produces zero errors and equivalent performance for the same workload. This suggests the issue may be specific to how `I_MPI_FILESYSTEM_FORCE` routes into the UFS module, rather than in ROMIO's UFS module itself.
### Intel MPI 2021.11 (oneAPI 2024.0) — Shared-File Baseline Affected
| Test Configuration | Tool | Scale | ADIO Errors | Outcome |
|---|---|---|---|---|
| Shared-file, no tuning (baseline) | PNetCDF fandc | 8 nodes / 224 PE | **140** | **FAIL — data corruption (cprnc diff = 1.010)** |
| Shared-file, no tuning (baseline) | IOR | 8 nodes / 224 PE | **28** | **FAIL** |
| Shared-file + `I_MPI_FILESYSTEM_FORCE=ufs` | IOR | 8 nodes / 224 PE | **56** | **FAIL** |
| Shared-file + `ROMIO_FSTYPE_FORCE="ufs:"` | PNetCDF fandc | 8 nodes / 224 PE | **140** | **FAIL — data corruption** |
| Shared-file, no tuning (baseline) | IOR (strace) | 2 nodes | crash | **FAIL — ADIO_OPEN crash** |
On 2021.11, shared-file MPI-IO fails at multi-node scale even without any tuning variables set. The `ADIO_OPEN(522)` error occurs regardless of environment variable settings — `I_MPI_FILESYSTEM_FORCE`, `ROMIO_FSTYPE_FORCE`, and `I_MPI_FILESYSTEM_NFS_DIRECT` all produce the same failure. With PNetCDF fandc at 8 nodes, this manifests as 140 non-fatal ADIO errors with **silent data corruption** (verified via `cprnc` comparison utility showing diff = 1.010, where 0.000 is expected). The issue appears to be in the shared-file ADIO open path introduced in this release.
**Error message (2021.11):**
```
MPI error (MPI_File_open): Unknown error class, error stack:
ADIO_OPEN(522): open failed on a remote node
```
Note the line number changed from 522 (2021.11) to 535 (2021.17), which suggests the relevant code path was modified between releases. The shared-file failure symptom persists under different conditions in each version.
### Intel MPI 2021.10 — Works Correctly (Reference)
| Test Configuration | ADIO Errors | Outcome |
|---|---|---|
| Shared-file, no tuning (baseline) | **0** | PASS |
| Shared-file + `ROMIO_FSTYPE_FORCE="ufs:"` | **0** | PASS |
| PNetCDF fandc (4-node, 112 PE) | **0** | PASS — zero data diff |
| PNetCDF fandc (8-node, 224 PE) | **0** | PASS — zero data diff |
Intel MPI 2021.10 handles shared-file MPI-IO correctly at all tested scales with zero ADIO errors and zero data corruption, confirming this is a regression introduced in 2021.11.
## Cross-Version Summary
| Intel MPI Version | Shared-File Baseline | `I_MPI_FILESYSTEM_FORCE=ufs` | `ROMIO_FSTYPE_FORCE="ufs:"` |
|---|---|---|---|
| **2021.10** | PASS (0 errors) | N/A (uses older `I_MPI_EXTRA_*` syntax) | PASS (0 errors) |
| **2021.11** | ADIO_OPEN errors at multi-node scale + data corruption | ADIO_OPEN errors | ADIO_OPEN errors |
| **2021.17** | PASS (0 errors) | ADIO_OPEN errors (28 at 8 nodes) + MPI_Abort | PASS (0 errors) |
## Key Observations
1. **The issue first appears in Intel MPI 2021.11.** Version 2021.10 works correctly at all tested scales; 2021.11 exhibits shared-file MPI-IO failures on NFS at multi-node scale.
2. **2021.17 resolves the baseline failure but introduces a different issue.** The default NFS module's shared-file open works correctly again on 2021.17. However, setting `I_MPI_FILESYSTEM_FORCE=ufs` triggers the ADIO_OPEN failure on what would otherwise be a working baseline.
3. **The upstream ROMIO variable behaves differently.** `ROMIO_FSTYPE_FORCE="ufs:"` works correctly on 2021.17 for the same shared-file workload where `I_MPI_FILESYSTEM_FORCE=ufs` fails. This suggests the two variables may follow different code paths when selecting the UFS module.
4. **File-per-process mode is unaffected.** `I_MPI_FILESYSTEM_FORCE=ufs` works correctly for N-N (file-per-process) patterns on 2021.17. The issue is specific to shared-file (N-to-1) opens.
5. **Strace-verified.** All findings were confirmed via `strace` on individual MPI ranks, not just performance observation.
## Expected Behavior
`I_MPI_FILESYSTEM_FORCE=ufs` should select the UFS ROMIO module without introducing ADIO_OPEN failures, consistent with how `ROMIO_FSTYPE_FORCE="ufs:"` behaves. The shared-file open path should work identically regardless of which variable is used to select the UFS module.
## Current Workaround
Use `ROMIO_FSTYPE_FORCE="ufs:"` instead of `I_MPI_FILESYSTEM_FORCE=ufs` for all shared-file MPI-IO workloads on NFS:
```bash
mpirun -env ROMIO_FSTYPE_FORCE "ufs:" -np <N> ./application
```
For Intel MPI 2021.11 specifically, the only viable path is to downgrade to 2021.10, as the shared-file baseline itself is broken and no environment variable provides a workaround.
## Request
We would appreciate guidance on the following:
1. Could the `I_MPI_FILESYSTEM_FORCE=ufs` code path for shared-file (N-to-1) MPI-IO on NFS be investigated in Intel MPI 2021.17+? The upstream `ROMIO_FSTYPE_FORCE="ufs:"` works for the same workload, suggesting the issue may be in how the Intel-specific variable routes into the UFS module.
2. Could the shared-file ROMIO regression in Intel MPI 2021.11 be investigated, and could Intel confirm whether versions 2021.12 through 2021.16 are also affected?
3. Would it be possible to document `ROMIO_FSTYPE_FORCE="ufs:"` as a recommended workaround for shared-file I/O on NFS in the interim?
We are happy to provide additional logs, strace output, or run further tests if that would help the investigation. Thank you for your time and for the excellent Intel MPI toolkit.
---
*Findings independently confirmed via both IOR (shared-file mode) and PNetCDF fandc test suite across multiple node counts (2, 4, 8 nodes) on a production HPC cluster with NFS v3 over RDMA storage.*
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@JamesChenSG
Is this failure specific to NFS v3 over RDMA or can you also reproduce the issue on a standard NFS?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for looking into this so quickly.
Our environment uses NFS v3 over RDMA — specifically, InfiniBand (ConnectX-6) on the compute side and RoCEv2 on the storage backend networking. Unfortunately, we do not have a standard NFS v3 over TCP environment available to test against at this time, so I cannot confirm whether the issue reproduces there as well.
One observation that may help: on the same NFS mount with the same transport configuration, ROMIO_FSTYPE_FORCE="ufs:" completes successfully while I_MPI_FILESYSTEM_FORCE=ufs produces the ADIO_OPEN failure. Since both tests use identical NFS transport, this might suggest the issue is related to how the two variables route into the UFS module rather than the underlying transport — but of course you would know the internals far better than I do.
I am happy to run additional tests if it would help narrow things down — for example, with different nconnect values, mount options, or any other configuration you would like me to try. Please let me know.
Thank you so much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@JamesChenSG can you please provide the full set of mount / export options for your nfs mount? If we are not able to reproduce the error it will be hard to fix it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi TobiasK,
Thank you for looking into this. Here are the mount details you requested, along with a self-contained reproducer and supporting debug data across three Intel MPI versions.
---
## 1. NFS Mount Configuration
Client-side mount options (from `/proc/mounts`):
```
[nfs-server]:/export /scratch nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,nconnect=32,timeo=600,retrans=2,sec=sys,local_lock=none 0 0
```
Key parameters:
- **Protocol:** NFSv3 over RDMA (InfiniBand, libfabric provider: mlx)
- **nconnect:** 32
- **rsize/wsize:** 1 MiB
- **Kernel:** 4.18.0-348.23.1.el8_5.x86_64 (Rocky Linux 8.5)
The NFS server is a commercial NFS appliance (not a Linux kernel NFS server). However, we believe the bug is in Intel MPI's ROMIO layer, not in the NFS server — evidence below.
---
## 2. Minimal Reproducer
This C program performs a collective `MPI_File_open` + `MPI_File_write_at_all` + `MPI_File_close` on a single shared file. No external dependencies beyond MPI. It exercises the same ROMIO `ADIO_OPEN` code path used by PNetCDF, parallel HDF5, and IOR shared-file mode.
```c
/*
* mpiio_shared_file_repro.c
* Minimal reproducer for ROMIO ADIO_OPEN failure on NFS
*
* Build: mpicc -o repro mpiio_shared_file_repro.c
* Run: mpirun -np N ./repro /path/to/nfs/testfile.dat
*
* Requires >=2 nodes to trigger the bug.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <mpi.h>
#define BLOCK_SIZE (1024 * 1024)
#define PATTERN_BYTE 0xAB
int main(int argc, char **argv) {
int rank, nprocs, rc;
MPI_File fh;
MPI_Status status;
char *buf;
MPI_Offset offset;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
if (argc < 2) {
if (rank == 0) fprintf(stderr, "Usage: %s <output_file_on_nfs>\n", argv[0]);
MPI_Finalize();
return 1;
}
if (rank == 0) {
char hostname[256];
gethostname(hostname, sizeof(hostname));
printf("=== MPI-IO Shared-File ADIO_OPEN Reproducer ===\n");
printf("Ranks: %d | Rank0: %s | File: %s\n", nprocs, hostname, argv[1]);
printf("Block: %d bytes/rank | Total: %d MiB\n", BLOCK_SIZE, nprocs);
printf("================================================\n");
fflush(stdout);
}
buf = (char *)malloc(BLOCK_SIZE);
memset(buf, PATTERN_BYTE, BLOCK_SIZE);
MPI_Barrier(MPI_COMM_WORLD);
/* Phase 1: Collective open - triggers ADIO_OPEN bug on 2021.11+ */
if (rank == 0) { printf("Phase 1: MPI_File_open (collective)\n"); fflush(stdout); }
rc = MPI_File_open(MPI_COMM_WORLD, argv[1],
MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &fh);
if (rc != MPI_SUCCESS) {
char errstr[MPI_MAX_ERROR_STRING]; int errlen;
MPI_Error_string(rc, errstr, &errlen);
fprintf(stderr, "Rank %d: MPI_File_open FAILED: %s\n", rank, errstr);
}
MPI_Barrier(MPI_COMM_WORLD);
if (rc != MPI_SUCCESS) {
if (rank == 0) { printf("RESULT: FAIL - MPI_File_open error (ADIO_OPEN bug)\n"); fflush(stdout); }
free(buf); MPI_Finalize(); return 1;
}
/* Phase 2: Collective write */
if (rank == 0) { printf("Phase 2: MPI_File_write_at_all\n"); fflush(stdout); }
offset = (MPI_Offset)rank * BLOCK_SIZE;
rc = MPI_File_write_at_all(fh, offset, buf, BLOCK_SIZE, MPI_BYTE, &status);
if (rc != MPI_SUCCESS) {
char errstr[MPI_MAX_ERROR_STRING]; int errlen;
MPI_Error_string(rc, errstr, &errlen);
fprintf(stderr, "Rank %d: MPI_File_write_at_all FAILED: %s\n", rank, errstr);
}
/* Phase 3: Close */
if (rank == 0) { printf("Phase 3: MPI_File_close\n"); fflush(stdout); }
MPI_File_close(&fh);
/* Phase 4: Rank 0 verifies */
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0) {
printf("Phase 4: Verification\n");
FILE *fp = fopen(argv[1], "rb");
if (!fp) { printf("RESULT: FAIL - cannot reopen\n"); }
else {
int total = BLOCK_SIZE * nprocs, ok = 1, i;
char *vbuf = (char *)malloc(total);
size_t nread = fread(vbuf, 1, total, fp);
fclose(fp);
if ((int)nread != total) { ok = 0; }
else { for (i=0; i<total; i++) { if ((unsigned char)vbuf[i] != PATTERN_BYTE) { ok=0; break; } } }
printf("RESULT: %s (%d bytes %s)\n", ok ? "PASS" : "FAIL - DATA CORRUPTION",
total, ok ? "verified OK" : "mismatch detected");
free(vbuf);
}
fflush(stdout);
}
free(buf);
MPI_Finalize();
return 0;
}
```
**Build and run:**
```bash
# Build separately against each Intel MPI version
mpicc -o repro mpiio_shared_file_repro.c
# Run on >=2 nodes, shared NFS file path
mpirun -np 112 -ppn 28 ./repro /path/to/nfs/testfile.dat
```
---
## 3. Test Results — Three Intel MPI Versions
We compiled the reproducer separately against each Intel MPI version and ran on **4 nodes, 28 ranks per node (112 total)**, all writing to the same shared file on NFS. We also confirmed the same behavior with IOR (shared-file mode, `-a MPIIO` without `-F`).
### Minimal Reproducer Results
| Test | Intel MPI Version | Configuration | Failed Ranks | Result |
|------|-------------------|---------------|-------------|--------|
| 1A | **2021.10** Build 20230619 | baseline | 0 / 112 | **PASS** |
| 2A | **2021.11** Build 20231005 | baseline | 112 / 112 | **FAIL** |
| 2B | **2021.11** Build 20231005 | `ROMIO_FSTYPE_FORCE="ufs:"` | 112 / 112 | **FAIL** |
| 2C | **2021.11** Build 20231005 | `I_MPI_FILESYSTEM_FORCE=ufs` | 112 / 112 | **FAIL** |
| 3A | **2021.17** Build 20251215 | baseline | 0 / 112 | **PASS** |
| 3B | **2021.17** Build 20251215 | `ROMIO_FSTYPE_FORCE="ufs:"` | 0 / 112 | **PASS** |
| 3C | **2021.17** Build 20251215 | `I_MPI_FILESYSTEM_FORCE=ufs` | 112 / 112 | **FAIL** |
### IOR Cross-Tool Confirmation
| Intel MPI Version | Configuration | ADIO Errors | Outcome |
|-------------------|---------------|-------------|---------|
| **2021.11** | baseline shared-file | 28 | MPI_Abort |
| **2021.17** | `I_MPI_FILESYSTEM_FORCE=ufs` | 119 | MPI_Abort |
The bug reproduces with both our minimal C program and IOR, confirming the issue is in the ROMIO layer rather than any specific application.
---
## 4. Error Messages
### Intel MPI 2021.11 — all configurations fail
```
Rank 0: MPI_File_open FAILED: Unknown error class, error stack:
ADIO_OPEN(522): open failed on a remote node
ADIOI_UFS_OPEN(37): File <nfs_path>/testfile.dat does not exist
```
All 112 ranks report the same error. No environment variable (`ROMIO_FSTYPE_FORCE`, `I_MPI_FILESYSTEM_FORCE`, or baseline) resolves it.
### Intel MPI 2021.17 with `I_MPI_FILESYSTEM_FORCE=ufs`
Two error classes are reported — 84 remote-node ranks report `File does not exist`, while 28 local-node ranks (on the same node as rank 0) report `Other I/O error`:
```
Rank 56: MPI_File_open FAILED: File does not exist, error stack:
internal_File_open(3211): MPI_File_open(MPI_COMM_WORLD, filename=<redacted>, amode=5, ...) failed
ADIO_OPEN(535): open failed on a remote node
ADIOI_UFS_OPEN(37): File <nfs_path>/testfile.dat does not exist
Rank 26: MPI_File_open FAILED: Other I/O error , error stack:
internal_File_open(3211): MPI_File_open(MPI_COMM_WORLD, filename=<redacted>, amode=5, ...) failed
ADIO_OPEN(535): open failed on a remote node
```
Note: 2021.17 baseline and `ROMIO_FSTYPE_FORCE="ufs:"` both work correctly — only `I_MPI_FILESYSTEM_FORCE=ufs` triggers the failure.
---
## 5. Forensic Observations
### ROMIO source line numbers changed between versions
- **2021.11:** `ADIO_OPEN(522)` → `ADIOI_UFS_OPEN(37)`
- **2021.17:** `ADIO_OPEN(535)` → `ADIOI_UFS_OPEN(37)`
The `ADIO_OPEN` line shifted by 13 lines (522 → 535), indicating the code path was modified between these releases. The inner function `ADIOI_UFS_OPEN(37)` remains at the same line, consistent with the failure occurring in the file-open syscall on remote ranks.
### Error class changed
- **2021.11:** Reports `Unknown error class` — suggesting the error code from `ADIO_OPEN` is not mapped to a recognized MPI error class
- **2021.17:** Reports `File does not exist` (84 remote-node ranks) and `Other I/O error` (28 local-node ranks) — more specific error mapping, but the same root cause
### `internal_File_open(3211)` appears only in 2021.17
The 2021.17 error stack includes an additional frame (`internal_File_open` at line 3211) not present in 2021.11, which may help your team locate the change in the ROMIO source tree.
### strace of rank 0 on 2021.11 (file-related syscalls only)
```
lstat("<nfs_path>/testfile.dat", ...) = -1 ENOENT # file doesn't exist yet
openat("/etc/romio-hints", O_RDONLY) = -1 ENOENT # no hints file (expected)
openat("<nfs_path>/testfile.dat", O_WRONLY|O_CREAT, 0644) = 33 # rank 0 creates file — SUCCESS
close(33) = 0
openat("<nfs_path>/testfile.dat", O_RDWR) = 33 # rank 0 reopens — SUCCESS
close(33) = 0
write(stderr, "Rank 0: MPI_File_open FAILED: Un...") # MPI_File_open still reports failure
```
Rank 0 successfully creates and opens the file at the POSIX level (both `openat` calls return fd=33). The `MPI_File_open` call still returns an error, which suggests the failure occurs during ROMIO's broadcast-to-remote-ranks phase — remote ranks attempt to open the file before NFS has propagated the directory entry, and ROMIO does not retry or wait for convergence.
### Bug is multi-node specific
In our earlier testing, single-node runs passed on all versions regardless of rank count. The failure requires >=2 nodes, consistent with a race condition in ROMIO's collective open protocol between the creating rank and remote-node ranks.
---
## 6. I_MPI_DEBUG Output
With `I_MPI_DEBUG=30` on the 2021.11 failing run, the error chain visible in the output is:
```
ADIOI_UFS_OPEN(37): File <nfs_path>/testfile.dat does not exist
```
This message repeats for every remote-node rank (84 out of 112 ranks are on nodes 2–4). The MPI startup and fabric initialization complete normally:
```
[0] MPI startup(): Intel(R) MPI Library, Version 2021.11 Build 20231005 (id: 74c4a23)
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
```
This confirms the communication layer is healthy and the failure is localized to the ROMIO `MPI_File_open` path.
---
## 7. Summary
The behavior across versions suggests:
1. **2021.10** handles the NFS collective open correctly at all scales
2. **2021.11** introduced a regression in the `ADIO_OPEN` path that breaks all multi-node shared-file opens on NFS, with no environment variable workaround
3. **2021.17** partially fixed the default path, but `I_MPI_FILESYSTEM_FORCE=ufs` reintroduces the same failure through the UFS module's `ADIOI_UFS_OPEN(37)` code path
Environment summary:
- **OS:** Rocky Linux 8.5, kernel 4.18.0-348.23.1.el8_5
- **Network:** InfiniBand, libfabric provider `mlx`
- **NFS:** NFSv3 over RDMA, nconnect=32, rsize/wsize=1 MiB
- **Compiler:** GCC 8.5.0 (via `mpicc`)
- **Scale:** 4 nodes, 28 ranks per node, 112 total
The reproducer above should allow your team to reproduce this on any multi-node NFS setup. I am happy to provide additional debug data (full `I_MPI_DEBUG=30` logs, complete strace output, IOR logs) or run further tests if that would be helpful.
Best regards,
James
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Follow-up: Deep Debug Results — Root Cause Identified
Hi TobiasK,
Following up on my previous post with additional debug data. We captured
strace from both local (rank 0) and remote (rank 28) ranks on 2021.10
and 2021.11, and found the exact syscall difference that causes the
failure.
ROOT CAUSE
==========
The ROMIO open flags changed between 2021.10 and 2021.11. Remote-node
ranks lost the O_CREAT flag:
2021.10 rank 28 (remote node, SUCCESS):
lstat("testfile.dat") = -1 ENOENT
openat("testfile.dat", O_RDWR|O_CREAT, 0644) = 33 <-- SUCCESS
2021.11 rank 28 (remote node, FAILURE):
lstat("testfile.dat") = -1 ENOENT
openat("testfile.dat", O_WRONLY) = -1 ENOENT <-- FAIL
In 2021.10, all ranks (including remote nodes) open the shared file
with O_RDWR|O_CREAT. On NFS, even if rank 0's file creation has not
propagated to the remote node's directory cache yet, the remote rank's
O_CREAT ensures the open succeeds — NFS handles create-if-not-exists
atomically at the server.
In 2021.11, remote ranks open with O_WRONLY only — no O_CREAT. If the
NFS directory entry has not propagated from rank 0's node to the remote
node (typically a few milliseconds), the open fails with ENOENT. This
is the ADIOI_UFS_OPEN(37) error we reported.
The change appears to be in the ROMIO ADIOI_UFS_OPEN function, which
reports the same source line (37) in both versions.
TIMING EVIDENCE
===============
From strace timestamps on the 2021.11 failing run (4 nodes, 112 ranks):
Rank 0 creates file: 20:44:34.336xxx (O_WRONLY|O_CREAT -> fd 33)
Rank 0 reopens file: 20:44:34.339xxx (O_RDWR -> fd 33)
Rank 28 attempts open: 20:44:34.341xxx (O_WRONLY -> ENOENT)
Rank 28 writes error: 20:44:34.345xxx
The gap between rank 0's file creation and rank 28's open attempt is
approximately 5 milliseconds. This is well within normal NFS directory
cache propagation latency. The 2021.10 code tolerates this with O_CREAT;
the 2021.11 code does not.
MINIMUM TRIGGER SCALE
=====================
We tested Intel MPI 2021.11 at progressively smaller scales:
1 node x 28 ranks (28 total): PASS
2 nodes x 1 rank/node (2 total): FAIL
2 nodes x 2 ranks/node (4 total): FAIL
2 nodes x 14 ranks/node (28 total): FAIL
2 nodes x 28 ranks/node (56 total): FAIL
The minimum configuration that triggers the bug is 2 nodes with just
1 rank per node. Single-node runs pass at any rank count. This confirms
the issue is purely about cross-node file visibility during the
collective open, not about rank count or contention.
FULL STRACE COMPARISON (file operations on the shared test file only)
=====================================================================
2021.10 rank 0 (local node, file creator):
lstat("t1.dat") = -1 ENOENT
openat("t1.dat", O_WRONLY|O_CREAT, 0644) = 33
close(33) = 0
openat("t1.dat", O_RDWR) = 33
openat("t1.dat", O_RDWR|O_CREAT|O_DIRECT, 0644) = 34
[write operations follow normally]
2021.10 rank 28 (remote node):
lstat("t1.dat") = -1 ENOENT
openat("t1.dat", O_RDWR|O_CREAT, 0644) = 33 <-- O_CREAT present
openat("t1.dat", O_RDWR|O_CREAT|O_DIRECT, 0644) = 34
[write operations follow normally]
2021.11 rank 0 (local node, file creator):
lstat("t2.dat") = -1 ENOENT
openat("t2.dat", O_WRONLY|O_CREAT, 0644) = 33
close(33) = 0
openat("t2.dat", O_RDWR) = 33
close(33) = 0
[rank 0 writes error despite local opens succeeding]
2021.11 rank 28 (remote node):
lstat("t2.dat") = -1 ENOENT
openat("t2.dat", O_WRONLY) = -1 ENOENT <-- no O_CREAT!
[immediate failure, no further file operations]
ROMIO_PRINT_HINTS OUTPUT
========================
Both versions print identical ROMIO hint keys (romio_cb_read=automatic,
romio_cb_write=automatic, etc.). The hints system is not involved in
the failure — the difference is in the open syscall flags only.
SUMMARY
=======
The bug is a one-line change in the ROMIO open flags: remote-node ranks
in 2021.11 use O_WRONLY instead of O_RDWR|O_CREAT when opening the
shared file during a collective MPI_File_open. On any network filesystem
where directory entries may not be immediately visible on remote clients
(NFS, and potentially others), this causes ENOENT failures.
The fix should be to restore O_CREAT in the open flags for the
ADIOI_UFS_OPEN path on non-rank-0 processes, matching the 2021.10
behavior.
All strace data was captured using the same minimal C reproducer
attached to my previous post, running on 4 nodes / 112 ranks.
Happy to provide the full strace files or run any additional tests
if needed.
Best regards,
James
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page