Solved: Re: Crash using impi - Page 2

L__D__Marks · ‎11-13-2022

I would welcome suggestions as to the source of an error within the impi code, for reasons that I don't know as I do not have access to it.

I have a crash "integer divide by zero" using impi which gives me an error message (first part of the trace):

0 0x0000000000b696ed next_random() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_types.h:1809
1 0x0000000000b696ed impi_bcast_intra_huge() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h:667
2 0x0000000000b6630d impi_bcast_intra_heap() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h:798
3 0x000000000018ef6d MPIDI_POSIX_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/intel/posix_coll.h:124
4 0x000000000017335e MPIDI_SHM_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_coll.h:39
5 0x000000000017335e MPIDI_Bcast_intra_composition_alpha() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:303
6 0x000000000017335e MPID_Bcast_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1726
7 0x000000000017335e MPIDI_coll_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3356
8 0x0000000000153bee MPIDI_coll_select() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:129
9 0x000000000021c02d MPID_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:51
10 0x00000000001386e9 PMPI_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/bcast/bcast.c:416
11 0x00000000000e8924 pmpi_bcast_() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/binding/fortran/mpif_h/bcastf.c:270

L__D__Marks · ‎02-19-2023

Inlined with >

We are able to build the application with different oneAPI versions starting from 2021.1.1 (even this version is not supported we were hoping to reproduce "divided by zero" -- please see next paragraph). We are getting just hangs with any oneAPI version. The hangs seem to occur around " allocate legendre 175.5 MB ..."

> The I/O is buffered, so that is not the actual location, it is the last released write.

For completeness -- we are running on one node with InfiniBand in interactive mode allocated by Slurm. Simple MPI jobs run fine on these node(s) -- I'm always checking before running your app.

I consulted with our IMPI developers regarding intel_transport_types.h:1809 (IMPI2021.1) -- their response: "function (at this location) cannot generate division by zero. The function is simple enough." I and another engineer also looked a the code and so far we cannot come up with an explanation how divided by zero was generated there (nor we can reproduce this error yet with 2021.1)

> I expect that the code was overwritten, and/or it was a line or so before or after; I do not know what compilation options were used in impi so I can only speculate.

> We might try a debug version of IMPI library for more clues. We also tried to compile with "-g -backtrace" and run under gdb with one rank. But it looks like it may take us some time to find why the application hangs.

> That will take forever. I think you mean -traceback, which I already did in my code. Valgrind might be better.

Meanwhile, may I ask these questions:

- would it be possible to get a smaller workload that would run just a few minutes.

> No, because that will not have the issue!

The workload you gave is supposed to run ~ 70 min. I'm still not sure if we are using your scripts correctly -- and smaller workload would help to catch that. BTW, have you been able to run with smaller workloads?

> I am routinely running with 120 cores, also smaller versions (which have no issues) as are my postdocs. Around the world there are probably 100 people running smaller workloads of the same program at any given time.

You can run the 63 core case, that will complete -- I thought this was done before.

- Is there a last version of IMPI that you were able to run with? From any older Parallel Studio packages maybe?

> I did not have access to a 64 core machine before, so I never tested.

- Would it be possible for us to compile and run (your app) with different MPI? We have OpenMPI, MVAPICH, MPICH installation on our cluster.

> The code works, except for this case. Since your developers are telling you that the Intel ifort output is wrong, it does not look like this is a good approach. I can think of several things I can do when I have time. Perhaps it is best if I solve this, as I have some coding and debugging experience -- I just don't have the source code.

View solution in original post

ShivaniK_Intel · ‎01-24-2023

Hi,

Could you please let us know the network you have been using?

Could you also please let us know whether you are able to run simple benchmarks such as IMB on your Cluster?

>>>"I have a crash "integer divide by zero" using impi which gives me an error message (first part of the trace)"

Could you please elaborate more on this line you have mentioned you have a crash "integer divide by zero"? As we don't see "integer divide by zero" anywhere in the report?

Thanks & Regards

Shivani

L__D__Marks · ‎01-25-2023

In terms your questions:

1. Somewhat minimal technical specifications of the cluster are at https://www.it.northwestern.edu/departments/it-services-support/research/computing/quest/specs.html I am using "Quest11" for this issue. It is an infiniband network. I have asked for more details.

2. Since this is a large cluster, network tests have of course been run -- but not by me as I am not a sys_admin. The network is fast.

3. I have attached the full log of the crash, please see the first two lines and the last. It occurs at line 1809 of intel_transport_types.h

ShivaniK_Intel · ‎02-01-2023

Hi,

Could you please run the IMB test supplied with Intel MPI on your cluster, as you can run it without being system admin . Just PingPong (MPI-1).

IMB benchmarks User Guide –

https://www.intel.com/content/www/us/en/develop/documentation/imb-user-guide/top.html

Thanks & Regards

Shivani

L__D__Marks · ‎02-02-2023

Why?

It is a cluster with, currently, 6240 Slurm jobs running on about 700 nodes.

Please convince me how this is relevant to impi crashing. Latency is not relevant.

ShivaniK_Intel · ‎02-06-2023

Hi,

Could you please let us whether this error happens only with the specific application you have provided?

Thanks & Regards

Shivani

L__D__Marks · ‎02-07-2023

The error occurs with the specific number of mpi as stated. You previously told me that you reproduced it.

L__D__Marks · ‎02-07-2023

You last question is not relevant. Please respond to what I asked three months ago; what leads to the divide by zero in the Intel code. I have provided full information and you stated on Jan 17th that you reproduce it.

Mark_L_Intel · ‎02-09-2023

Hello,

Sorry if you already answered on these questions, but please confirm:

- the issue has been reported from a very old version of IMPI : 2021.1 . Is the problem reproducible/applicable for a recent release as well? I think you stated that you saw hangs with most recent version 2021.8?

- the reproducer attached is a third-party application. Could we have a simple reproducer that we can use to reproduce the problem?

We currently have IMPI engineering team ready to look at your case but it would help to speed things up if you could answer.

L__D__Marks · ‎02-09-2023

I have answered all of these before:

- the issue has been reported from a very old version of IMPI : 2021.1 .

That is because more recent versions are both slower and die more often.

-Is the problem reproducible/applicable for a recent release as well? I think you stated that you saw hangs with most recent version 2021.8?

More recent versions are worse.

- the reproducer attached is a third-party application.

The reproducer is where it occurs. I have written parts of this code.

- Could we have a simple reproducer that we can use to reproduce the problem?

No

To repeat the original question, what is the source of the divide by zero. Please focus on this, and the application I provided. Trivial codes rarely have major issues. Big, major codes used by thousands represent a different class of problem. I gave you the relevant part of a major code, www.wien2k.at.

Earlier I was informed that you had reproduced the issue. Then I start getting not so clever questions, for instance asking whether a cluster of >700 nodes has a working MPI.

Mark_L_Intel · ‎02-14-2023

After internal discussion -- unfortunately, it is not possible for the engineering team to investigate the crash log, without setting up the dev systems to the actual release it was reported on.

2021.1 is not a supported release (https://www.intel.com/content/www/us/en/developer/articles/release-notes/intel-parallel-studio-xe-supported-and-unsupported-product-versions.html).

Looking at your previous comments -- that in general the later versions behave even worst -- hope on your understanding that developers would need to reproduce a specific problem as reported (division by zero) with the latest release.

Regarding your comments about questions "whether a cluster of >700 nodes has a working MPI." Actually, the question was if microbenchmarks can be run -- was to make sure if we deal with the application specific problem or not. This is one of our standard questions and most of time people are willing to answer it. Your answer is that the error is application specific, that is, it is observed only with your application. Please note, that in this case, it is possible that the error (division by zero) comes from the application layer (at least in theory), correct? So far, our developers were not able to connect the error in your log with some problem within Intel MPI (even some IMPI lines are obviously listed in your log).

L__D__Marks · ‎02-14-2023

Inlining:

| Looking at your previous comments -- that in general the later versions behave even worst -- hope on your understanding that

| developers would need to reproduce a specific problem as reported (division by zero) with the latest release.

Please go ahead and use a later release -- you will reproduce the problem or it will be worse.

|our answer is that the error is application specific, that is, it is observed only with your application.

I never said that. I said that the important case is in this application. To ask me to find an Intel impi issue within 10,000 lines of code is unreasonable. All I asked from the start is to know why there was a divide by zero in your code

| Please note, that in this case, it is possible that the error (division by zero) comes from the application layer (at least in theory), correct?

No. The divide by zero is in the impi code, the traceback says "intel_transport_bcast.h:667"

| So far, our developers were not able to connect the error in your log with some problem within Intel MPI (even some IMPI lines are obviously listed in your log).

That is not the same as saying that there is no problem. You still have not answered my first question!

Not to be rude, but I need to be blunt. It is not obvious to me what you are trying to achieve. My first email was 11-13-2022. It took Shivani until 01-17-2023 before it appeared that the issue was being reproduced, to quote:

"We are able to reproduce the issue with M1 and are able to run without any problem with M1b. We are working on it and will get back to you."

Following thism from 01-24-2023 new questions start getting asked that have nothing to do with your reproducing the issue. I have the strong impression from your emails that you are at best deflecting the problem.

Mark_L_Intel · ‎02-15-2023

Could you please take a look at the attached output ptF.output1up_1.log (from your application compiled with IMPI 2021.8.0; I added *log extension to be able to add the output file as an attachment)? Does it look as the expected output from a good run? Or please share expected good output that we can use for comparison. We continue with the investigation with different versions of IMPI.

L__D__Marks · ‎02-15-2023

That is a run which hang and/or crashed. If it crashed then there would be some information in either standard output, PtF.dayfile or job output if it was submitted. If it hang then there will be some "run out of time" in the job output, or the user stopped it.

After the log you gave there should be something similar to the attached -- the timings will be different, about 1.7 times what are in the attached with the Mflops about 1.7 times smaller.

Mark_L_Intel · ‎02-17-2023

We are able to build the application with different oneAPI versions starting from 2021.1.1 (even this version is not supported we were hoping to reproduce "divided by zero" -- please see next paragraph). We are getting just hangs with any oneAPI version. The hangs seem to occur around " allocate legendre 175.5 MB ..."

For completeness -- we are running on one node with InfiniBand in interactive mode allocated by Slurm. Simple MPI jobs run fine on these node(s) -- I'm always checking before running your app.

I consulted with our IMPI developers regarding intel_transport_types.h:1809 (IMPI2021.1) -- their response: "function (at this location) cannot generate division by zero. The function is simple enough." I and another engineer also looked a the code and so far we cannot come up with an explanation how divided by zero was generated there (nor we can reproduce this error yet with 2021.1)

We might try a debug version of IMPI library for more clues. We also tried to compile with "-g -backtrace" and run under gdb with one rank. But it looks like it may take us some time to find why the application hangs.

Meanwhile, may I ask these questions:

- would it be possible to get a smaller workload that would run just a few minutes. The workload you gave is supposed to run ~ 70 min. I'm still not sure if we are using your scripts correctly -- and smaller workload would help to catch that. BTW, have you been able to run with smaller workloads?

- Is there a last version of IMPI that you were able to run with? From any older Parallel Studio packages maybe?

- Would it be possible for us to compile and run (your app) with different MPI? We have OpenMPI, MVAPICH, MPICH installation on our cluster.

L__D__Marks · ‎02-19-2023

Inlined with >

We are able to build the application with different oneAPI versions starting from 2021.1.1 (even this version is not supported we were hoping to reproduce "divided by zero" -- please see next paragraph). We are getting just hangs with any oneAPI version. The hangs seem to occur around " allocate legendre 175.5 MB ..."

> The I/O is buffered, so that is not the actual location, it is the last released write.

For completeness -- we are running on one node with InfiniBand in interactive mode allocated by Slurm. Simple MPI jobs run fine on these node(s) -- I'm always checking before running your app.

I consulted with our IMPI developers regarding intel_transport_types.h:1809 (IMPI2021.1) -- their response: "function (at this location) cannot generate division by zero. The function is simple enough." I and another engineer also looked a the code and so far we cannot come up with an explanation how divided by zero was generated there (nor we can reproduce this error yet with 2021.1)

> I expect that the code was overwritten, and/or it was a line or so before or after; I do not know what compilation options were used in impi so I can only speculate.

> We might try a debug version of IMPI library for more clues. We also tried to compile with "-g -backtrace" and run under gdb with one rank. But it looks like it may take us some time to find why the application hangs.

> That will take forever. I think you mean -traceback, which I already did in my code. Valgrind might be better.

Meanwhile, may I ask these questions:

- would it be possible to get a smaller workload that would run just a few minutes.

> No, because that will not have the issue!

The workload you gave is supposed to run ~ 70 min. I'm still not sure if we are using your scripts correctly -- and smaller workload would help to catch that. BTW, have you been able to run with smaller workloads?

> I am routinely running with 120 cores, also smaller versions (which have no issues) as are my postdocs. Around the world there are probably 100 people running smaller workloads of the same program at any given time.

You can run the 63 core case, that will complete -- I thought this was done before.

- Is there a last version of IMPI that you were able to run with? From any older Parallel Studio packages maybe?

> I did not have access to a 64 core machine before, so I never tested.

- Would it be possible for us to compile and run (your app) with different MPI? We have OpenMPI, MVAPICH, MPICH installation on our cluster.

> The code works, except for this case. Since your developers are telling you that the Intel ifort output is wrong, it does not look like this is a good approach. I can think of several things I can do when I have time. Perhaps it is best if I solve this, as I have some coding and debugging experience -- I just don't have the source code.

ShivaniK_Intel · ‎03-06-2023

Hi,

Going ahead and closing this thread. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks & Regards

Shivani