Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
The Intel sign-in experience is changing in February to support enhanced security controls. If you sign in, click here for more information.

Crash using impi

L__D__Marks
New Contributor I
983 Views

I would welcome suggestions as to the source of an error within the impi code, for reasons that I don't know as I do not have access to it.

I have a crash "integer divide by zero" using impi which gives me an error message (first part of the trace):

 

0 0x0000000000b696ed next_random() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_types.h:1809
1 0x0000000000b696ed impi_bcast_intra_huge() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h:667
2 0x0000000000b6630d impi_bcast_intra_heap() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h:798
3 0x000000000018ef6d MPIDI_POSIX_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/intel/posix_coll.h:124
4 0x000000000017335e MPIDI_SHM_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_coll.h:39
5 0x000000000017335e MPIDI_Bcast_intra_composition_alpha() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:303
6 0x000000000017335e MPID_Bcast_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1726
7 0x000000000017335e MPIDI_coll_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3356
8 0x0000000000153bee MPIDI_coll_select() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:129
9 0x000000000021c02d MPID_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:51
10 0x00000000001386e9 PMPI_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/bcast/bcast.c:416
11 0x00000000000e8924 pmpi_bcast_() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/binding/fortran/mpif_h/bcastf.c:270

 

0 Kudos
27 Replies
ShivaniK_Intel
Moderator
853 Views

Hi,


Thanks for posting in the Intel forums.


Could you please provide us with the command line and steps to reproduce the issue on our end?


Could you also provide us with the OS details, mpi version?


Thanks & Regards

Shivani


L__D__Marks
New Contributor I
845 Views

The crash occurred within a program of approximately 50,000 lines of Fortran (Density Functional Theory code Wien2k, www.Wien2k.at) when I was doing a 64 core calculation using Linux on a Gold 6338 which takes ~30 minutes. If you (Intel) are willing to assign someone with sufficient core access to investigate, they should contact me outside this list so we can work out how to proceed.

 

Or you can provide me with information about the code that crashed.

 

Or both.

 

What is not possible is some simple command to reproduce this.

L__D__Marks
New Contributor I
831 Views

The attached should help you reproduce it. Expand the package (tar -xj , i.e. bzip2) then look at the README inside. Please contact me if you need clarification. This may well be tricky.

ShivaniK_Intel
Moderator
790 Views

Hi,

 

Could you also provide us with the OS details, mpi version?

 

Thanks & Regards

Shivani

 

L__D__Marks
New Contributor I
782 Views

 

The Intel version is 2021.1.1 . Later versions have worse problems, hanging with no information.

 

$ uname -a

Linux qnode1058 3.10.0-1160.71.1.el7.x86_64 #1 SMP Wed Jun 15 08:55:08 UTC 2022 x86_64 x86_64 x86_64

 GNU/Linux

 

[lma712@qnode1058 ~]$ head -20 /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
stepping : 6
microcode : 0xd000363
cpu MHz : 2000.000
cache size : 49152 KB
physical id : 0
siblings : 32
core id : 0
cpu cores : 32
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes

 

 

ShivaniK_Intel
Moderator
754 Views

Hi,

 

Thanks for providing the reproducer and the steps to reproduce it.

 

As provided in the README file I have followed the Step1 below are the results

 

Elpa1.png

 

 

 

 

While trying to follow Step 2 below are the results.

 

elpa2.png

 

Could you please help us with how to proceed further?

 

Thanks & Regards

Shivani

 

L__D__Marks
New Contributor I
743 Views

Apologies, my errors when extracting part of a large code into a small reproducer.

 

For the first approach, do first "export SCRATCH=./" before running it.

 

For the second, edit the Makefile so it has

"ELPAROOT = ../elpa22/"

I forgot the "/" at the end, it is needed

ShivaniK_Intel
Moderator
710 Views

Hi,

 

I tried following the steps provided in README and this is the error I'm getting when trying to run this step 

x lapw1 -p -orb -up

Screenshot1.png

 

I have sourced oneAPI setvars.sh script still I'm getting the mpirun command not found error.

 

Thanks & Regards

Shivani

 

L__D__Marks
New Contributor I
702 Views
I suspect that the problem is because eim511 either does not have mpirun or has it in a different location to where you are running it -- the directory is not nfs mounted. Try editing in lapw1para to "set mpiremote = 0" so the mpirun is local.

You may also need to do "ssh eim511 which mpirun" & "ssh eim512 ldd lapw1_mpi".
ShivaniK_Intel
Moderator
643 Views

Hi,


Could you please let us know how much time the below command takes to execute?


x lapw1 -p -orb -up


I have been waiting for more than 1.5 hours but did not get any output.


Thanks & Regards

Shivani


L__D__Marks
New Contributor I
625 Views
It should take about 70 mins, or forever. Sometimes it crashes with the error code I mentioned before.

You can do tail PtF.outputup_1, which should have something.

N.B., this is with a .machines file with 1:node01:64
ShivaniK_Intel
Moderator
396 Views

Hi,


>>>"You can do tail PtF.outputup_1, which should have something".


Could you please let us know how can we get the file PtF.outputup_1?


Thanks & Regards

Shivani


L__D__Marks
New Contributor I
389 Views
It is created when you run "x lapw1 -up -p -orb"

Are you following the instructions I gave? You question suggests that you are not.
ShivaniK_Intel
Moderator
368 Views

Hi,

 

I am following the instructions given by you.

 

As mentioned in my previous post I am unable to proceed further after the command "x lapw1 -p -orb -up".It got Struck and could not proceed further, so unable to get the file PtF.outputup_1 after running the command.

 

Could you please let us know how to proceed further?

 

ec2083b1-25ce-42f6-ad1a-fe331c418d95.png

 

Thanks & Regards

Shivani

 

L__D__Marks
New Contributor I
360 Views
The file PtF.output1_up is created when you run the command. You are not following the instructions properly, or have not installed mpi correctly. I suggest that you do a simple test such as an mpi "hello world". Maybe ask someone experienced with high-level parallel computing to help you.
ShivaniK_Intel
Moderator
301 Views

Hi,

 

Could you please let me know if I'm on right track in replicating the issue that you are observing?

 

Please refer to the below document for the output of x lapw1 -p -orb -up command and tail PtF.output1up_1 (5th step in Readme).

 

>>>"I suggest that you do a simple test such as a mpi "hello world"".

 

I am able to run the simple test case mpi hello world program. Please refer below screenshot for the results.

sample_mpi.png

 

Thanks & Regards

Shivani

 

L__D__Marks
New Contributor I
285 Views

You are reproducing the problem.

Depending upon which version of oneapi you are using ( as I mentioned before), it may hang forever -- what you are seeing -- or give the error code I reported in my first message. Since I do not have access to the impi source code I do not know what the error is due to, which is why I posted in the first case.

For reference,  if instead of M1 you use M1b or M1c attached the program should run through without the problem -- but that is not a real cure. I will also suggest replacing PtF.klist with the version attached, as it will be faster. (They are in a tgz for tar.gz which is attached).

ShivaniK_Intel
Moderator
223 Views


Hi,


Apologies for the delay.


>>>"For reference, if instead of M1 you use M1b or M1c attached the program should run through without the problem"


Even though I have replaced M1 with M1b I am facing the same error. Could you please let us know if there is anything else that should be changed?


Thanks & Regards

Shivani


L__D__Marks
New Contributor I
213 Views

I am not  sure that you are doing anything wrong. Please provide PtF.output1up_1 for me to check.

How long are you waiting for the program to run?

ShivaniK_Intel
Moderator
164 Views

Hi,


Thank you for your patience.


We are able to reproduce the issue with M1 and are able to run without any problem with M1b. We are working on it and will get back to you.


Thanks & Regards

Shivani


Reply