Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
1959 Discussions

-check_mpi option causes a seg fault

Kevin_McGrattan
1,007 Views

I am running a job that uses 10 MPI processes. When all 10 processes are put onto one node of a linux cluster, the case runs as expected. When the 10 processes are split 5 and 5 over two nodes, the results are different, and the job fails. I thought that I would compile (the Fortran program) with -check_mpi. I am using oneAPI 2021.4 for the Fortran compiler and MPI libraries. 

 

When I run my case that is split over two nodes, I see the usual preliminaries in standard err

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON

but when the program starts, it fails immediately with the messages:

 

srun: error: burn007: tasks 0-2,4: Segmentation fault (core dumped)
srun: Terminating job step 195466.0
slurmstepd: error: *** STEP 195466.0 ON burn007 CANCELLED AT 2021-11-09T11:00:08 ***
srun: error: burn008: tasks 6-7,9: Segmentation fault (core dumped)

 

burn007 and burn008 are just node names. This program runs fine without the -check_mpi option, at least until the failure occurs much later on. Is there something I need to do beyond just adding -check_mpi to my list of compiler/linker options?

 

0 Kudos
11 Replies
ShivaniK_Intel
Moderator
984 Views

 

Hi,

 

Thanks for reaching out to us.

 

Could you please provide us with the results of the cluster checker by running the below command?

clck -f <nodefile>

 

For more details regarding the cluster checker, you can refer to the below link.

 

https://www.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/getting...

 

Could you also please confirm whether you are facing a similar issue with the sample mpi hello world program?

mpiifort sample_mpi.f90 -o sample
mpirun -n <no. of processes> -ppn <processes per node> -f nodefile ./sample

 

For sample mpi hello world program you can refer to the attachment.

 

Thanks & Regards

Shivani

 

Kevin_McGrattan
957 Views

The "Hello World" case runs successfully with the -check_mpi option:

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20

 Hello world: rank            0  of            4  running on
 burn003

 Hello world: rank            1  of            4  running on
 burn003

 Hello world: rank            2  of            4  running on
 burn004

 Hello world: rank            3  of            4  running on
 burn004


[0] INFO: Error checking completed without finding any problems.

 

Kevin_McGrattan
957 Views
SUMMARY
  Command-line:   clck -f nodefile
  Tests Run:      health_base
  **WARNING**:    2 tests failed to run. Information may be incomplete. See clck_execution_warnings.log for more information.
  Overall Result: 5 issues found - FUNCTIONALITY (2), HARDWARE UNIFORMITY (2), PERFORMANCE (1)
------------------------------------------------------------------------------------------------------------------------------------
36 nodes tested:         burn[001-009], burn[010-036]
17 nodes with no issues: burn[003-006,009], burn[010-013,016-023]
19 nodes with issues:    burn[001-002,007-008], burn[014-015,024-036]
------------------------------------------------------------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
  1. Port '1' of InfiniBand HCA 'mlx4_0' is in the 'Polling' physical state, not the 'LinkUp' physical state.
       12 nodes: burn[025-036]
  2. Port '1' of InfiniBand HCA 'mlx4_0' is in the 'Down' state, not the 'Active' state.
       12 nodes: burn[025-036]

HARDWARE UNIFORMITY
The following hardware uniformity issues were detected:
  1. The InfiniBand PCI revision for device 'IBA7322 QDR InfiniBand HCA' in slot 0000:04:00.0, '1', is not uniform. 3% of nodes in
     the same grouping have the same revision.
       1 node: burn035
  2. The 'qib0' InfiniBand HCA hardware version, '1', is not uniform. 3% of nodes in the same grouping have the same hardware
     version.
       1 node: burn035

PERFORMANCE
The following performance issues were detected:
  1. Processes using high CPU.
       10 nodes: burn[001-002,007-008], burn[014-015,024-025,035-036]

SOFTWARE UNIFORMITY
No issues detected.

See the following files for more information: clck_results.log, clck_execution_warnings.log
Kevin_McGrattan
881 Views

The option -check_mpi works with the simple "Hello World" program. My issue is that my program, which is far more complicated, fails when I introduce this option. Cases run fine without the option. I wanted to use the option because I had a case which produced slightly different results when running its 10 MPI processes on a single node or multiple nodes, and it also produced different results when using the tcp fabric vs ofi. I did find where one of the MPI threads would hang when I used the -check_mpi. I found this by just adding print statements to the code. The -check_mpi option did not produce any line numbers or explanation for the seg faults. So I know that one of my MPI processes gets hung up in a call to MPI_IRECV, but I do not know why and the -check_mpi option does not indicate why. As I said before, the cases runs without the -check_mpi option, but produces slightly different results depending on the configuration. 

 

So my issue is that the -check_mpi provides no information about the problem of different results. In fact, it causes the job to hang and ultimately fail. My options are either to wait for the next version and try again (I'm  using the latest), or figure out if there is some set of options to use along with -check_mpi to look for bad MPI calls in my code.

ShivaniK_Intel
Moderator
934 Views

Hi,


Thanks for providing the details.


We have observed from the logs of the cluster checker provided by you that the nodes (burn007 & burn008) with which you are facing issues initially

are having performance issues.


We have also observed that you are able to run the hello world program successfully on two nodes(burn003 & burn004) as these nodes are with no issues.


Could you please work on the 17 nodes with no issues (burn[003-006,009], burn[010-013,016-023]) and let us know if your issue still persists.


Thanks & Regards

Shivani


Kevin_McGrattan
772 Views

I do not believe that this is related to particular nodes. This error with -check_mpi happens on two different cluster with different fabrics and communicators.  

 

I took our code and ran a 32 mpi process test case and everything ran fine. Then I just added -check_mpi to the compiler arguments. When I ran the case again, it failed due to seg faults:

srun: error: burn038: tasks 11-13: Segmentation fault (core dumped)
srun: Terminating job step 195732.0
slurmstepd: error: *** STEP 195732.0 ON burn037 CANCELLED AT 2021-11-12T11:14:07 ***
srun: error: burn038: tasks 16-17,19-20: Segmentation fault
srun: error: burn037: tasks 0-3: Segmentation fault (core dumped)
srun: error: burn037: tasks 4-9: Segmentation fault
srun: error: burn039: tasks 23-25,28-29,31: Segmentation fault
srun: error: burn039: task 27: Segmentation fault (core dumped)
srun: error: burn038: tasks 14-15,18: Segmentation fault (core dumped)

 

I started adding print statements to figure out where the seg faults were occurring. I traced one MPI process to an MPI_IRECV call:

 

CALL MPI_IRECV(M2%IIO_S(1),M2%NIC_S,MPI_INTEGER,PROCESS(NM),NM,MPI_COMM_WORLD,REQ(N_REQ+1),IERR)

 

I checked all the arguments and they are fine. I cannot figure out what is wrong because the process is killed and I cannot evaluate IERR. 

 

Other MPI processes die in different places, but there is some rationale to which processes fail and when. Of course, none of this happens when I remove the -check_mpi option. 

 

So my question is this: what can I do to diagnose the problem (if there is really a problem) when the tool I am using to diagnose the problem causes fatal errors? If, as you say, there is something wrong with our cluster or network or whatever, shouldn't this tool tell me? Also, why would the case run fine without this diagnostic turned on?

 

You may not be able to answer these questions and I may just have to wait for the next update. That is usually how these things get fixed because it is impossible for me to pinpoint the problem any further than I have. But I would like to know if there is some additional compiler option that might help me figure out if there is something improper about the MPI_IRECV call. That is the point, I assume, of the -check_mpi option. 

 

ShivaniK_Intel
Moderator
841 Views

Hi,

 

Could you please provide us with the sample reproducer to investigate more on your issue?

 

Could you also provide the debug log of your program using I_MPI_DEBUG=20?

 

For example:

I_MPI_DEBUG=20 mpirun -n <no. of processes> -ppn <processes per node> -f nodefile ./a.out

 

Thanks & Regards

Shivani

 

Kevin_McGrattan
831 Views
$ I_MPI_DEBUG=20
$ mpirun -n 2 -ppn 2 -f nodefile ~/firemodels/fds/Build/impi_intel_linux_64_db/fds_impi_intel_linux_64_db device_test.fds
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20


 Starting FDS ...

 MPI Process      0 started on blaze.el.nist.gov
 MPI Process      1 started on blaze.el.nist.gov

 Reading FDS input file ...


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 7749 RUNNING AT blaze
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 7750 RUNNING AT blaze
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
Kevin_McGrattan
831 Views

In  my previous post, I have pasted the results of running a test case using the -check_mpi option. I have tracked down where the seg fault originates. It is an MPI_IRECV call, but if I remove the -check_mpi option, everything runs normally. 

There are two options to fix this -- (1) suggest some option that I can add that might help understand better what is wrong with MPI call or perhaps get a useful error message, or (2) you could check out the program from our GitHub account, compile it, and run a test case yourself. I'd like to say that this would be easy for you to do, but you know how there's always a million glitches that would prevent you from compiling, at least in a short amount of time. In the best of all possible worlds, you would just clone the repo, type make, and everything would work. I'd be happy to explain what to do if you want to go down that road.

ShivaniK_Intel
Moderator
762 Views

Hi,


We are working on it and will get back to you soon.


Thanks & Regards

Shivani



ShivaniK_Intel
Moderator
716 Views

Hi,


We have reported this issue to the development team, they are looking into this issue.


Thanks & Regards

Shivani


Reply