- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Launching this on Fedora 40
mpiexec -machinefile mfile -configure cfile someprogram
I encountered sporadic error like this
Abort(1614735) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(192)........:
MPID_Init(1665)..............:
MPIDI_OFI_mpi_init_hook(1665):
create_vni_context(2245).....: OFI EP enable failed (ofi_init.c:2245:create_vni_context:Address already in use)
This does not happen every time. If it happens and then I relaunch it and then it can run fine.
Is there anyway to get rid of this problem permanently?
What is this "Address already in use" error?
I have already search through the discussions and none of them seem to apply directly.
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
@YaDev
Please provide at least your HW and SW environment and the output of I_MPI_DEBUG=10.
If you can reproduce the failure, please add I_MPI_HYDRA_DEBUG=1 and I_MPI_DEBUG=120
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
By setting those flags, there are a lot of debug output which also seemed to have stopped the
original OFI EP enable failed (ofi_init.c:2245:create_vni_context:Address already in use) error.
I was running MPI many many times in succession one after another using a script. Maybe the "vni context address" was not released fast enough between runs but having all these debug output slowed things down enough for it to be released before the next mpiexec ... call?
Is that possible?