Hello,
I was running MPI job on multiple nodes with Intel MPI 2021.1.1, jobs aborted due to the following error:
[1690463626.483072] [n148:434957:0] cma_ep.c:62 UCX ERROR process_vm_readv(pid=434958 length=42432) returned -1: No such process
[1690463626.521152] [n148:434969:0] cma_ep.c:62 UCX ERROR process_vm_readv(pid=434968 length=42432) returned -1: No such process
[1690463626.522181] [n148:434970:0] cma_ep.c:62 UCX ERROR process_vm_readv(pid=434969 length=42432) returned -1: No such process
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 434942 RUNNING AT n148
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 434942 RUNNING AT n148
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
...
Under what conditions does this error occur?
It is difficult to provide detailed information such as execution script, but I hope to obtain some clues for resolving this error.
MPI job was running on 16 nodes, and the same job was running on other nodes at the same time.
Information on OS, kernel, and ucx versions is below:
OS: CentOS 8.4
kernel: 4.18.0-305.25.1.el8_4.x86_64
OFED: MLNX_OFED_LINUX-4.9-4.0.8.0
UCX: 1.8.0
Thanks,
1kan
連結已複製
Hi,
Thanks for posting in the Intel forums.
Could you please let us know whether you are facing a similar issue with the latest version of Intel oneAPI 2023.2?
Could you please try with the supported OS version. For more details please refer to the below link.
Please provide us the complete debug log setting I_MPI_DEBUG=10 and also the command line you have been using.
Thanks & Regards
Shivani
Hi Shivani,
Thank you for your reply.
Intel oneAPI 2023.2 is not installed on the system I am using.
I just want to run MPI jobs using Intel oneAPI 2021.1.1.
If you know anything about what causes this error, please let me know.
Thanks,
1kan
Hi,
Please provide us the complete debug log setting I_MPI_DEBUG=10 and also the command line you have been using.
Could you please provide us with the sample reproducer and steps to reproduce the issue at our end?
Could you also please let us know whether you are able to run your application on a single node and Intel MPI benchmark on a multi-node which will help us to
investigate the issue at our end?
Thanks & Regards
Shivani
Hi Shivani,
Sorry for the late reply.
As you suggested, I would consider using Intel oneAPI 2023.2 to run MPI jobs.
Thanks,
1kan
