- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
When we compile with '-pg' option, the following message was received during execution.
hfi_userinit: assign_context command failed: Interrupted system call hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3) rank 0 : Hello, World! rank 1 : Hello, World!
This causes code performing heavy numerical computations to hang.
The only related information we can find on this issue is from Intel OPA repo: https://github.com/intel/opa-psm2/issues/28
Here are our system information:
- Linux 3.10.0-1062.el7.x86_64
- Intel 2019 Update 5
- hfi1-firmware-0.9-84
We appreciate your insight on how to minimize the interrupted system calls.
Regards.
- Tags:
- Cluster Computing
- General Support
- Intel® Cluster Ready
- Message Passing Interface (MPI)
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Are you getting this message for every execution or is this a random thing?
Can you provide details like which compiler you are using and the version?
Please provide the compilation and execution commands you are using.
Thanks
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We've repeated the 'hello, world' test 10 times for each different numbers of MPI rank.
The compilation was done with Intel Fortran Compiler and Intel MPI library 2019 update 5, as mentioned in the first post.
mpiifort -pg hello.f90 -o hello.x
The execution was done as follow:
mpirun -np #nranks ./hello.x
(Where nranks = 1, 2, 4, 8, 16)
Even with just one sole MPI rank, the message appears randomly, and on an average of 4/10 times for small number of ranks.
When the number of ranks exceed 16, the message always appears at the beginning of execution
I hope it can help with the diagnosis.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please provide us the log info with the following flags set.
export I_MPI_DEBUG=5
export FI_LOG_LEVEL=debug
Thanks
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A apologize for the wall of text. The following debug information was generated using 2 MPI ranks.
No 'Interrupted system call': https://justpaste.it/57lp1
With 'Interrupted system call': https://justpaste.it/3t6r2
The aforementioned message was occurred between calls to libfabric:psm2:core:psmx2_trx_ctxt_alloc()
libfabric:psm2:core:psmx2_trx_ctxt_alloc():282<info> uuid: 00FF00FF-0000-0000-0000-00FF00FF00FF libfabric:psm2:core:psmx2_trx_ctxt_alloc():287<info> ep_open_opts: unit=-1 port=0 node8102.17670hfi_userinit: assign_context command failed: Interrupted system call node8102.17670hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3) libfabric:psm2:core:psmx2_trx_ctxt_alloc():320<info> epid: 0000000003d30d02 (tx+rx) libfabric:psm2:core:psmx2_am_init():116<info> epid 0000000003d30d02
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are escalating your issue to the respective team.
Thanks
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I included here more data in case it may help with diagnosis.
1. 'Hello, World!' random test:
n = 1: 4/10 (i.e. 1 MPI rank, 4 of 10 times the "Interrupted system call" appears)
n = 2: 2/10
n = 4: 5/10
n = 8: 10/10
n = 16: 10/10
With sufficiently large MPI ranks, the message always appears at beginning of the execution.
2. Single-node test: VASP (5.4.4), QE(6.5), LAMMPS(12Dec18), GROMACS(2019.6)
Output from widely-used codes are attached in the zip file. Tests were conducted on KNL architecture using 64 MPI ranks.
3. Multi-node test:
For more than 1024 MPI ranks, calculation may, but not always, crash with following error message:
node8054.60631PSM2 can't open hfi unit: -1 (err=23) node8054.60632PSM2 can't open hfi unit: -1 (err=23) Abort(1615759) on node 338 (rank 338 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(703)........: MPID_Init(923)...............: MPIDI_OFI_mpi_init_hook(1211): create_endpoint(1892)........: OFI endpoint open failed (ofi_init.c:1892:create_endpoint:Invalid argument) Abort(1615759) on node 339 (rank 339 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(703)........: MPID_Init(923)...............: MPIDI_OFI_mpi_init_hook(1211): create_endpoint(1892)........: OFI endpoint open failed (ofi_init.c:1892:create_endpoint:Invalid argument)
The testing environment is same as outlined in our first post. We are open to your suggestions for further tests.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the delay. I wanted to let you know I filed a bug report. Will let you know the status of the issue.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looking through some older threads... do you still need assistance here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I hope to know about this issue. Could you let me know whether this issue was solve or not it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe this is the answer to your question
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pls can you provide and updated link or solution, thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The link no longer works, do you know if there is a way of fixing the error without removing the -pg flag.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page