Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Viet-Duc
Beginner
506 Views

Interrupted system call from gprof

Hi, 

When we compile with '-pg' option, the following message was received during execution.

hfi_userinit: assign_context command failed: Interrupted system call
hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
 rank           0 : Hello, World!
 rank           1 : Hello, World!

This causes code performing heavy numerical computations to hang. 

The only related information we can find on this issue is from Intel OPA repo: https://github.com/intel/opa-psm2/issues/28

Here are our system information: 

- Linux 3.10.0-1062.el7.x86_64

- Intel 2019 Update 5 

- hfi1-firmware-0.9-84

We appreciate your insight on how to minimize the interrupted system calls. 

Regards.   

0 Kudos
9 Replies
PrasanthD_intel
Moderator
506 Views

Hi,

Are you getting this message for every execution or is this a random thing?

Can you provide details like which compiler you are using and the version?

Please provide the compilation and execution commands you are using.

 

Thanks

Prasanth

 

Viet-Duc
Beginner
506 Views

We've repeated the 'hello, world' test 10 times for each different numbers of MPI rank.

The compilation was done with Intel Fortran Compiler and Intel MPI library 2019 update 5, as mentioned in the first post.

mpiifort -pg hello.f90 -o hello.x 

The execution was done as follow:

mpirun -np #nranks ./hello.x

(Where nranks = 1, 2, 4, 8, 16)

Even with just one sole MPI rank, the message appears randomly, and on an average of 4/10 times for small number of ranks.

When the number of ranks exceed 16, the message always appears at the beginning of execution

I hope it can help with the diagnosis.

Thanks.

PrasanthD_intel
Moderator
506 Views

Hi,

Please provide us the log info with the following flags set.

export I_MPI_DEBUG=5

export FI_LOG_LEVEL=debug

 

Thanks

Prasanth

Viet-Duc
Beginner
506 Views

A apologize for the wall of text. The following debug information was generated using 2 MPI ranks. 

No 'Interrupted system call': https://justpaste.it/57lp1

With 'Interrupted system call': https://justpaste.it/3t6r2

The aforementioned message was occurred between calls to libfabric:psm2:core:psmx2_trx_ctxt_alloc()

libfabric:psm2:core:psmx2_trx_ctxt_alloc():282<info> uuid: 00FF00FF-0000-0000-0000-00FF00FF00FF
libfabric:psm2:core:psmx2_trx_ctxt_alloc():287<info> ep_open_opts: unit=-1 port=0
node8102.17670hfi_userinit: assign_context command failed: Interrupted system call
node8102.17670hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
libfabric:psm2:core:psmx2_trx_ctxt_alloc():320<info> epid: 0000000003d30d02 (tx+rx)
libfabric:psm2:core:psmx2_am_init():116<info> epid 0000000003d30d02
PrasanthD_intel
Moderator
506 Views

Hi,

We are escalating your issue to the respective team.

 

Thanks

Prasanth

Viet-Duc
Beginner
506 Views

I included here more data in case it may help with diagnosis.

1. 'Hello, World!' random test:

    n = 1: 4/10 (i.e. 1 MPI rank, 4 of 10 times the "Interrupted system call" appears) 

    n = 2: 2/10 

    n = 4: 5/10 

    n = 8: 10/10 

    n = 16: 10/10 

    With sufficiently large MPI ranks, the message always appears at beginning of the execution. 

2. Single-node test: VASP (5.4.4), QE(6.5), LAMMPS(12Dec18), GROMACS(2019.6) 

    Output from widely-used codes are attached in the zip file. Tests were conducted on KNL architecture using 64 MPI ranks. 

3. Multi-node test: 

    For more than 1024 MPI ranks, calculation may, but not always, crash with following error message:    

node8054.60631PSM2 can't open hfi unit: -1 (err=23)
node8054.60632PSM2 can't open hfi unit: -1 (err=23)
Abort(1615759) on node 338 (rank 338 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........: 
MPID_Init(923)...............: 
MPIDI_OFI_mpi_init_hook(1211): 
create_endpoint(1892)........: OFI endpoint open failed (ofi_init.c:1892:create_endpoint:Invalid argument)
Abort(1615759) on node 339 (rank 339 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........: 
MPID_Init(923)...............: 
MPIDI_OFI_mpi_init_hook(1211): 
create_endpoint(1892)........: OFI endpoint open failed (ofi_init.c:1892:create_endpoint:Invalid argument)

The testing environment is same as outlined in our first post. We are open to your suggestions for further tests. 

Thanks.

Kevin_O_Intel1
Employee
472 Views

Sorry for the delay. I wanted to let you know I filed a bug report. Will let you know the status of the issue.

Regards

Kevin_O_Intel1
Employee
178 Views

Looking through some older threads... do you still need assistance here?

cisong1
Beginner
149 Views

Hello, 

I hope to know about this issue.  Could you let me know whether this issue was solve or not it?

Reply