Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2153 Discussions

Interrupted system call from gprof

Viet-Duc
Novice
3,259 Views

Hi, 

When we compile with '-pg' option, the following message was received during execution.

hfi_userinit: assign_context command failed: Interrupted system call
hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
 rank           0 : Hello, World!
 rank           1 : Hello, World!

This causes code performing heavy numerical computations to hang. 

The only related information we can find on this issue is from Intel OPA repo: https://github.com/intel/opa-psm2/issues/28

Here are our system information: 

- Linux 3.10.0-1062.el7.x86_64

- Intel 2019 Update 5 

- hfi1-firmware-0.9-84

We appreciate your insight on how to minimize the interrupted system calls. 

Regards.   

0 Kudos
12 Replies
PrasanthD_intel
Moderator
3,259 Views

Hi,

Are you getting this message for every execution or is this a random thing?

Can you provide details like which compiler you are using and the version?

Please provide the compilation and execution commands you are using.

 

Thanks

Prasanth

 

0 Kudos
Viet-Duc
Novice
3,259 Views

We've repeated the 'hello, world' test 10 times for each different numbers of MPI rank.

The compilation was done with Intel Fortran Compiler and Intel MPI library 2019 update 5, as mentioned in the first post.

mpiifort -pg hello.f90 -o hello.x 

The execution was done as follow:

mpirun -np #nranks ./hello.x

(Where nranks = 1, 2, 4, 8, 16)

Even with just one sole MPI rank, the message appears randomly, and on an average of 4/10 times for small number of ranks.

When the number of ranks exceed 16, the message always appears at the beginning of execution

I hope it can help with the diagnosis.

Thanks.

0 Kudos
PrasanthD_intel
Moderator
3,259 Views

Hi,

Please provide us the log info with the following flags set.

export I_MPI_DEBUG=5

export FI_LOG_LEVEL=debug

 

Thanks

Prasanth

0 Kudos
Viet-Duc
Novice
3,259 Views

A apologize for the wall of text. The following debug information was generated using 2 MPI ranks. 

No 'Interrupted system call': https://justpaste.it/57lp1

With 'Interrupted system call': https://justpaste.it/3t6r2

The aforementioned message was occurred between calls to libfabric:psm2:core:psmx2_trx_ctxt_alloc()

libfabric:psm2:core:psmx2_trx_ctxt_alloc():282<info> uuid: 00FF00FF-0000-0000-0000-00FF00FF00FF
libfabric:psm2:core:psmx2_trx_ctxt_alloc():287<info> ep_open_opts: unit=-1 port=0
node8102.17670hfi_userinit: assign_context command failed: Interrupted system call
node8102.17670hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
libfabric:psm2:core:psmx2_trx_ctxt_alloc():320<info> epid: 0000000003d30d02 (tx+rx)
libfabric:psm2:core:psmx2_am_init():116<info> epid 0000000003d30d02
0 Kudos
PrasanthD_intel
Moderator
3,258 Views

Hi,

We are escalating your issue to the respective team.

 

Thanks

Prasanth

0 Kudos
Viet-Duc
Novice
3,261 Views

I included here more data in case it may help with diagnosis.

1. 'Hello, World!' random test:

    n = 1: 4/10 (i.e. 1 MPI rank, 4 of 10 times the "Interrupted system call" appears) 

    n = 2: 2/10 

    n = 4: 5/10 

    n = 8: 10/10 

    n = 16: 10/10 

    With sufficiently large MPI ranks, the message always appears at beginning of the execution. 

2. Single-node test: VASP (5.4.4), QE(6.5), LAMMPS(12Dec18), GROMACS(2019.6) 

    Output from widely-used codes are attached in the zip file. Tests were conducted on KNL architecture using 64 MPI ranks. 

3. Multi-node test: 

    For more than 1024 MPI ranks, calculation may, but not always, crash with following error message:    

node8054.60631PSM2 can't open hfi unit: -1 (err=23)
node8054.60632PSM2 can't open hfi unit: -1 (err=23)
Abort(1615759) on node 338 (rank 338 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........: 
MPID_Init(923)...............: 
MPIDI_OFI_mpi_init_hook(1211): 
create_endpoint(1892)........: OFI endpoint open failed (ofi_init.c:1892:create_endpoint:Invalid argument)
Abort(1615759) on node 339 (rank 339 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........: 
MPID_Init(923)...............: 
MPIDI_OFI_mpi_init_hook(1211): 
create_endpoint(1892)........: OFI endpoint open failed (ofi_init.c:1892:create_endpoint:Invalid argument)

The testing environment is same as outlined in our first post. We are open to your suggestions for further tests. 

Thanks.

0 Kudos
Kevin_O_Intel1
Employee
3,228 Views

Sorry for the delay. I wanted to let you know I filed a bug report. Will let you know the status of the issue.

Regards

0 Kudos
Kevin_O_Intel1
Employee
2,935 Views

Looking through some older threads... do you still need assistance here?

0 Kudos
cisong1
Beginner
2,905 Views

Hello, 

I hope to know about this issue.  Could you let me know whether this issue was solve or not it?

0 Kudos
Kevin_O_Intel1
Employee
2,685 Views
0 Kudos
Carlospdp
Beginner
2,109 Views

Pls can you provide and updated link or solution, thank you.

0 Kudos
Carlospdp
Beginner
2,119 Views

The link no longer works, do you know if there is a way of fixing the error without removing the -pg flag.

0 Kudos
Reply