- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am trying a mpmd run of an application with intel psxe 2020u4 (with intel mpi version as 2019u12).
export OMP_NUM_THREADS=1
export UCX_TLS=self,sm,dc
export I_MPI_DEBUG=30
export FI_LOG_LEVEL=debug
time mpirun -v -np 4716 $PWD/app1 : -n 54 $PWD/app2
i get following issue after ~20 minutes of run-
....
==== backtrace (tid: 441318) ====
0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
1 0x00000000004716c1 MPIDIG_context_id_to_comm() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-
linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_impl.h:67
2 0x0000000000472036 MPIDI_POSIX_mpi_iprobe() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-lin
ux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_probe.h:46
3 0x0000000000472036 MPIDI_SHM_mpi_iprobe() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux
-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:407
4 0x0000000000472036 MPIDI_iprobe_unsafe() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-
release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:94
5 0x0000000000472036 MPIDI_iprobe_safe() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:223
6 0x0000000000472036 MPID_Iprobe() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_probe.h:373
7 0x0000000000472036 PMPI_Iprobe() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/pt2pt/iprobe.c:110
8 0x00000000006a2226 xios::CContextServer::eventLoop() ???:0
9 0x00000000006893ea xios::CContext::checkBuffersAndListen() ???:0
10 0x0000000000ab3857 xios::CServer::eventLoop() ???:0
11 0x00000000006a71a2 xios::CXios::initServerSide() ???:0
12 0x000000000040ee50 MAIN__() ???:0
13 0x0000000000c539a6 main() ???:0
14 0x00000000000237b3 __libc_start_main() ???:0
15 0x000000000040ed6e _start() ???:0
....
the log size is ~5GB. I plan to share the logs using google drive.
Will update the link soon.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The code is nemo 3.6 compiled with gcom6.1, xios 2.0 and OASIS3-MCT4.0.
fi_info command is unavailable for normal users on our clusters.
I am in process of uploading the logs.
Meanwhile i recompiled the code with openmpi (with icc), and (as expected) the code works fine!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
Could you please let us know which application you are using so that we can try reproducing your issue at our end?
If providing a log takes time, then for the time being, could you please provide the following details to investigate more on your issue?
1. The FI_PROVIDER(mlx/psm2/psm3) you are using, please find the below command to get the details
fi_info
2. The sample reproducer code along with the steps, so that we can investigate more on your issue.
Thanks & Regards,
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The code is nemo 3.6 compiled with gcom6.1, xios 2.0 and OASIS3-MCT4.0.
fi_info command is unavailable for normal users on our clusters.
I am in process of uploading the logs.
Meanwhile i recompiled the code with openmpi (with icc), and (as expected) the code works fine!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please let us know if you are able to run the application using Intel MPI without any issues? If not, could you please provide us with the entire system configuration along with the steps to reproduce the issue?
And also, could you please let us know if you are able to run the application without any issues using other versions of Intel MPI?
Thanks & Regards,
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.
Thanks & Regards,
Varsha
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page