Solved: Re:Crash using impi

L__D__Marks · ‎11-13-2022

I would welcome suggestions as to the source of an error within the impi code, for reasons that I don't know as I do not have access to it.

I have a crash "integer divide by zero" using impi which gives me an error message (first part of the trace):

0 0x0000000000b696ed next_random() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_types.h:1809
1 0x0000000000b696ed impi_bcast_intra_huge() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h:667
2 0x0000000000b6630d impi_bcast_intra_heap() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h:798
3 0x000000000018ef6d MPIDI_POSIX_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/intel/posix_coll.h:124
4 0x000000000017335e MPIDI_SHM_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_coll.h:39
5 0x000000000017335e MPIDI_Bcast_intra_composition_alpha() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:303
6 0x000000000017335e MPID_Bcast_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1726
7 0x000000000017335e MPIDI_coll_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3356
8 0x0000000000153bee MPIDI_coll_select() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:129
9 0x000000000021c02d MPID_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:51
10 0x00000000001386e9 PMPI_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/bcast/bcast.c:416
11 0x00000000000e8924 pmpi_bcast_() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/binding/fortran/mpif_h/bcastf.c:270

L__D__Marks · ‎02-19-2023

Inlined with >

We are able to build the application with different oneAPI versions starting from 2021.1.1 (even this version is not supported we were hoping to reproduce "divided by zero" -- please see next paragraph). We are getting just hangs with any oneAPI version. The hangs seem to occur around " allocate legendre 175.5 MB ..."

> The I/O is buffered, so that is not the actual location, it is the last released write.

For completeness -- we are running on one node with InfiniBand in interactive mode allocated by Slurm. Simple MPI jobs run fine on these node(s) -- I'm always checking before running your app.

I consulted with our IMPI developers regarding intel_transport_types.h:1809 (IMPI2021.1) -- their response: "function (at this location) cannot generate division by zero. The function is simple enough." I and another engineer also looked a the code and so far we cannot come up with an explanation how divided by zero was generated there (nor we can reproduce this error yet with 2021.1)

> I expect that the code was overwritten, and/or it was a line or so before or after; I do not know what compilation options were used in impi so I can only speculate.

> We might try a debug version of IMPI library for more clues. We also tried to compile with "-g -backtrace" and run under gdb with one rank. But it looks like it may take us some time to find why the application hangs.

> That will take forever. I think you mean -traceback, which I already did in my code. Valgrind might be better.

Meanwhile, may I ask these questions:

- would it be possible to get a smaller workload that would run just a few minutes.

> No, because that will not have the issue!

The workload you gave is supposed to run ~ 70 min. I'm still not sure if we are using your scripts correctly -- and smaller workload would help to catch that. BTW, have you been able to run with smaller workloads?

> I am routinely running with 120 cores, also smaller versions (which have no issues) as are my postdocs. Around the world there are probably 100 people running smaller workloads of the same program at any given time.

You can run the 63 core case, that will complete -- I thought this was done before.

- Is there a last version of IMPI that you were able to run with? From any older Parallel Studio packages maybe?

> I did not have access to a 64 core machine before, so I never tested.

- Would it be possible for us to compile and run (your app) with different MPI? We have OpenMPI, MVAPICH, MPICH installation on our cluster.

> The code works, except for this case. Since your developers are telling you that the Intel ifort output is wrong, it does not look like this is a good approach. I can think of several things I can do when I have time. Perhaps it is best if I solve this, as I have some coding and debugging experience -- I just don't have the source code.

View solution in original post

ShivaniK_Intel · ‎11-14-2022

Hi,

Thanks for posting in the Intel forums.

Could you please provide us with the command line and steps to reproduce the issue on our end?

Could you also provide us with the OS details, mpi version?

Thanks & Regards

Shivani

L__D__Marks · ‎11-14-2022

The crash occurred within a program of approximately 50,000 lines of Fortran (Density Functional Theory code Wien2k, www.Wien2k.at) when I was doing a 64 core calculation using Linux on a Gold 6338 which takes ~30 minutes. If you (Intel) are willing to assign someone with sufficient core access to investigate, they should contact me outside this list so we can work out how to proceed.

Or you can provide me with information about the code that crashed.

Or both.

What is not possible is some simple command to reproduce this.

L__D__Marks · ‎11-18-2022

The attached should help you reproduce it. Expand the package (tar -xj , i.e. bzip2) then look at the README inside. Please contact me if you need clarification. This may well be tricky.

ShivaniK_Intel · ‎11-22-2022

Hi,

Could you also provide us with the OS details, mpi version?

Thanks & Regards

Shivani

L__D__Marks · ‎11-22-2022

The Intel version is 2021.1.1 . Later versions have worse problems, hanging with no information.

$ uname -a

Linux qnode1058 3.10.0-1160.71.1.el7.x86_64 #1 SMP Wed Jun 15 08:55:08 UTC 2022 x86_64 x86_64 x86_64

GNU/Linux

[lma712@qnode1058 ~]$ head -20 /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
stepping : 6
microcode : 0xd000363
cpu MHz : 2000.000
cache size : 49152 KB
physical id : 0
siblings : 32
core id : 0
cpu cores : 32
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes

ShivaniK_Intel · ‎12-01-2022

Hi,

Thanks for providing the reproducer and the steps to reproduce it.

As provided in the README file I have followed the Step1 below are the results

While trying to follow Step 2 below are the results.

Could you please help us with how to proceed further?

Thanks & Regards

Shivani

L__D__Marks · ‎12-01-2022

Apologies, my errors when extracting part of a large code into a small reproducer.

For the first approach, do first "export SCRATCH=./" before running it.

For the second, edit the Makefile so it has

"ELPAROOT = ../elpa22/"

I forgot the "/" at the end, it is needed

ShivaniK_Intel · ‎12-08-2022

Hi,

I tried following the steps provided in README and this is the error I'm getting when trying to run this step

x lapw1 -p -orb -up

I have sourced oneAPI setvars.sh script still I'm getting the mpirun command not found error.

Thanks & Regards

Shivani

L__D__Marks · ‎12-08-2022

I suspect that the problem is because eim511 either does not have mpirun or has it in a different location to where you are running it -- the directory is not nfs mounted. Try editing in lapw1para to "set mpiremote = 0" so the mpirun is local.

You may also need to do "ssh eim511 which mpirun" & "ssh eim512 ldd lapw1_mpi".

ShivaniK_Intel · ‎12-15-2022

Hi,

Could you please let us know how much time the below command takes to execute?

x lapw1 -p -orb -up

I have been waiting for more than 1.5 hours but did not get any output.

Thanks & Regards

Shivani

L__D__Marks · ‎12-15-2022

It should take about 70 mins, or forever. Sometimes it crashes with the error code I mentioned before.

You can do tail PtF.outputup_1, which should have something.

N.B., this is with a .machines file with 1:node01:64

ShivaniK_Intel · ‎12-25-2022

Hi,

>>>"You can do tail PtF.outputup_1, which should have something".

Could you please let us know how can we get the file PtF.outputup_1?

Thanks & Regards

Shivani

L__D__Marks · ‎12-25-2022

It is created when you run "x lapw1 -up -p -orb"

Are you following the instructions I gave? You question suggests that you are not.

ShivaniK_Intel · ‎12-26-2022

Hi,

I am following the instructions given by you.

As mentioned in my previous post I am unable to proceed further after the command "x lapw1 -p -orb -up".It got Struck and could not proceed further, so unable to get the file PtF.outputup_1 after running the command.

Could you please let us know how to proceed further?

Thanks & Regards

Shivani

L__D__Marks · ‎12-26-2022

The file PtF.output1_up is created when you run the command. You are not following the instructions properly, or have not installed mpi correctly. I suggest that you do a simple test such as an mpi "hello world". Maybe ask someone experienced with high-level parallel computing to help you.

ShivaniK_Intel · ‎01-01-2023

Hi,

Could you please let me know if I'm on right track in replicating the issue that you are observing?

Please refer to the below document for the output of x lapw1 -p -orb -up command and tail PtF.output1up_1 (5th step in Readme).

>>>"I suggest that you do a simple test such as a mpi "hello world"".

I am able to run the simple test case mpi hello world program. Please refer below screenshot for the results.

Thanks & Regards

Shivani

L__D__Marks · ‎01-02-2023

You are reproducing the problem.

Depending upon which version of oneapi you are using ( as I mentioned before), it may hang forever -- what you are seeing -- or give the error code I reported in my first message. Since I do not have access to the impi source code I do not know what the error is due to, which is why I posted in the first case.

For reference, if instead of M1 you use M1b or M1c attached the program should run through without the problem -- but that is not a real cure. I will also suggest replacing PtF.klist with the version attached, as it will be faster. (They are in a tgz for tar.gz which is attached).

ShivaniK_Intel · ‎01-10-2023

Hi,

Apologies for the delay.

>>>"For reference, if instead of M1 you use M1b or M1c attached the program should run through without the problem"

Even though I have replaced M1 with M1b I am facing the same error. Could you please let us know if there is anything else that should be changed?

Thanks & Regards

Shivani

L__D__Marks · ‎01-11-2023

I am not sure that you are doing anything wrong. Please provide PtF.output1up_1 for me to check.

How long are you waiting for the program to run?

ShivaniK_Intel · ‎01-17-2023

Hi,

Thank you for your patience.

We are able to reproduce the issue with M1 and are able to run without any problem with M1b. We are working on it and will get back to you.

Thanks & Regards

Shivani