Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1626 Discussions

MPI_Init hangs on s012-n004

Robert_C_Intel
Employee
2,472 Views

I cannot run a simple mpi program on s012-n004. It hangs in MPI_Init

 

u50659@s012-n004:mpitest$ cat hello.cpp
#include <iostream>

#include "mpi.h"

int main(int argc, char * argv[]) {
std::cout << "Hello" << std::endl;
MPI_Init(&argc, &argv);
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
std::cout << "Hello from " << rank << " of " << size << "\n";
MPI_Finalize();
}
u50659@s012-n004:mpitest$ which mpicxx
/glob/development-tools/versions/oneapi/2022.1.2/oneapi/mpi/2021.5.1//bin/mpicxx
u50659@s012-n004:mpitest$ mpicxx hello.cpp
u50659@s012-n004:mpitest$ ldd a.out
linux-vdso.so.1 (0x00007fff3dd54000)
libmpi.so.12 => /glob/development-tools/versions/oneapi/2022.1.2/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12 (0x00007f53aa\
95c000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f53aa76b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f53aa579000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f53aa56e000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f53aa568000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f53aa417000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f53aa3fc000)
/lib64/ld-linux-x86-64.so.2 (0x00007f53ac19a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f53aa3d9000)
u50659@s012-n004:mpitest$ export I_MPI_DEBUG=3
u50659@s012-n004:mpitest$ mpirun -n 2 ./a.out
Hello
Hello
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi

 

0 Kudos
12 Replies
AlekhyaV_Intel
Moderator
2,456 Views

Hi Robert,

 

Thank you for posting in Intel Communities. We couldn't access that particular node i.e. "s012-n004" as it is currently down. So we tried to reproduce your error in a different node i.e. "s001-n013" and it worked without producing any errors or hanging. We have attached the screenshot below.

So, we suggest you to try in a different node and update us if the issue still persists or not.

Hope this helps!

 

Regards,

Alekhya

 

 

 

0 Kudos
Robert_C_Intel
Employee
2,449 Views

It works fine for me on other nodes. When I request quad_gpu, it was giving me that node. I will try to get a quad_gpu on a different node.

 

Is there a page that describes the gpu configurations? This is what I think I see:

 

gpu: integrated gpu

dual_gpu: discrete gpu, no integrated gpu available

quad_gpu: 3 discrete gpus, no integrated gpu available

 

This morning, I am unable to get any nodes. It doesn't seem likely that none are available so I suspect I have some hung jobs that are blocking me. Is there something I need to do to clear the old jobs?

 

u50659@login-2:~$ qstat
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1840347.v-qsvr-1 STDIN u50659 00:00:04 R batch
1841484.v-qsvr-1 STDIN u50659 00:00:05 R batch
u50659@login-2:~$ qselect
1840347.v-qsvr-1.aidevcloud
1841484.v-qsvr-1.aidevcloud
u50659@login-2:~$ qselect | xargs qdel
qdel: Server could not connect to MOM 1840347.v-qsvr-1.aidevcloud
qdel: Server could not connect to MOM 1841484.v-qsvr-1.aidevcloud
u50659@login-2:~$ qsub -I -l nodes=1:quad_gpu:ppn=2 -d .
qsub: waiting for job 1841923.v-qsvr-1.aidevcloud to start
^CDo you wish to terminate the job and exit (y|[n])? y
Job 1841923.v-qsvr-1.aidevcloud is being deleted
u50659@login-2:~$ qsub -I -l nodes=1:dual_gpu:ppn=2 -d .
qsub: waiting for job 1841925.v-qsvr-1.aidevcloud to start
^CDo you wish to terminate the job and exit (y|[n])? y
Job 1841925.v-qsvr-1.aidevcloud is being deleted
u50659@login-2:~$ qsub -I -l nodes=1:ppn=2 -d .
qsub: waiting for job 1841926.v-qsvr-1.aidevcloud to start
^CDo you wish to terminate the job and exit (y|[n])? y
Job 1841926.v-qsvr-1.aidevcloud is being deleted
u50659@login-2:~$

 

0 Kudos
AlekhyaV_Intel
Moderator
2,430 Views

Hi @Robert_C_Intel ,

 

  • As you wanted to confirm about GPUs available on DevCloud, currently we don't have any documentation available. Could you please elaborate on what all you would like to know regarding the GPUs so that we can work on it internally.

 

  • And about the quad_gpu compute nodes, you can check the list of compute nodes with the below command:

 

pbsnodes

 

 

  The above command gives all the compute nodes present in devcloud along with their properties and state.

 

If you would like to list all the nodes with a particular property, please follow the below command:

 

pbsnodes | grep "<property>" -A 6 -B 4

 

 

Example:

For quad_gpu & dual_gpu nodes, 

 

pbsnodes | grep "quad_gpu" -A 6 -B 4
pbsnodes | greo "dual_gpu" -A 6 -B 4

 

 

The output looks like below:

 

s012-n003

   state = free

   power_state = Running

   np = 2

   properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,quad_gpu

   ntype = cluster

   status = rectime=1643698603,macaddr=d4:5d:64:08:fb:3c,cpuclock=Fixed,varattr=,jobs=,state=free,netload=42477541384,gres=,loadave=9.00,ncpus=24,physmem=32562336kb,availmem=33649520kb,totmem=34659484kb,idletime=127019,nusers=3,nsessions=4,sessions=126670 126685 1166623 1172551,uname=Linux s012-n003 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64,opsys=linux

   mom_service_port = 15002

   mom_manager_port = 15003

 

 

You could check for free nodes and try requesting the same using the below command:

 

qsub -I -l nodes=<node number>:ppn=2

 

 

  • For deleting a job, please try with the below commands:

 

qdel <job-ID>

 

   or

 

qdel all

 

 

If the jobs are not deleted even after trying with the above commands, please send us the Job-IDs. We will try deleting them from our end.

 

Regards,

Alekhya

 

 

0 Kudos
AlekhyaV_Intel
Moderator
2,417 Views

Hi Robert,


We could see that you've shared the job IDs in this thread(https://community.intel.com/t5/Intel-DevCloud/Cannot-get-interactive-node/m-p/1356356/emcs_t/S2h8ZW1haWx8Ym9hcmRfc3Vic2NyaXB0aW9ufEtaNDZUQjUwTUtQMVNPfDEzNTYzNTZ8U1VCU0NSSVBUSU9OU3xoSw#M4240). Could you please confirm whether the job IDs mentioned in the above thread are the ones you wanted to delete?

Also, please specify what all you want to know about GPU configurations on devcloud so that we can work on it internally.


Regards,

Alekhya


0 Kudos
Robert_C_Intel
Employee
2,409 Views

The jobs have been deleted.  Now that I can submit jobs again, I can look at the gpu info you provided.

 

 

0 Kudos
Robert_C_Intel
Employee
2,404 Views

Thanks for the tip on the getting GPU info. It is sufficient.

 

I still can't run MPI on quad gpu systems. It hangs in MPI_Init. This time I am trying on s012-n002. See above for the sample program and how to compile it. Here is what happens. I cannot kill it.

u50659@s012-n002:mpitest$ ldd a.out
linux-vdso.so.1 (0x00007ffd73ff8000)
libmpi.so.12 => /glob/development-tools/versions/oneapi/2022.1.2/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12 (0x00007f73becc3000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f73bead2000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f73be8e0000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f73be8d5000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f73be8cf000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f73be77e000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f73be763000)
/lib64/ld-linux-x86-64.so.2 (0x00007f73c0501000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f73be740000)
u50659@s012-n002:mpitest$ ls
a.out hello.cpp hello.cpp~
u50659@s012-n002:mpitest$ I_MPI_DEBUG=3 ./a.out
Hello
MPI startup(): PMI server not found. Please set I_MPI_PMI_LIBRARY variable if it is not a singleton case.
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
^C^C^C^C


^C^C^C^C^C^Z

 

0 Kudos
AlekhyaV_Intel
Moderator
2,370 Views

Hi Robert,


We apologize for the inconvenience caused. We could observe that MPI_Init hangs in all the quad_gpu nodes. We are working on this issue internally and will get back to you soon with an update.


Regards,

Alekhya


0 Kudos
AlekhyaV_Intel
Moderator
2,306 Views

Hey Robert,


We got an update from the admin team that the issue is resolved. MPI samples are working fine on the following quad_gpu nodes: s012-n004, s012-n003 & s012-n002. Could you please check & confirm the same?


Regards,

Alekhya


0 Kudos
Robert_C_Intel
Employee
2,301 Views

I still cannot run a MPI hello world program. Everything hangs.

 

u50659@s012-n002:mpitest$ sycl-ls
[opencl:0] ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.13.11.0.23_160000]
[opencl:0] CPU : Intel(R) OpenCL 3.0 [2021.13.11.0.23_160000]
[opencl:0] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.49.21786]
[opencl:1] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.49.21786]
[opencl:2] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.49.21786]
[level_zero:0] GPU : Intel(R) Level-Zero 1.2 [1.2.21786]
[level_zero:1] GPU : Intel(R) Level-Zero 1.2 [1.2.21786]
[level_zero:2] GPU : Intel(R) Level-Zero 1.2 [1.2.21786]
[host:0] HOST: SYCL host platform 1.2 [1.2]
u50659@s012-n002:mpitest$ mpirun -n 3 ./a.out
Hello
Hello
Hello
^C[mpiexec@s012-n002] Sending Ctrl-C to processes as requested
[mpiexec@s012-n002] Press Ctrl-C again to force abort

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 17626 RUNNING AT s012-n002
= KILLED BY SIGNAL: 2 (Interrupt)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 17627 RUNNING AT s012-n002
= KILLED BY SIGNAL: 2 (Interrupt)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 17628 RUNNING AT s012-n002
= KILLED BY SIGNAL: 2 (Interrupt)
===================================================================================

u50659@s012-n002:mpitest$ ^C
u50659@s012-n002:mpitest$ sycl-ls
^C^C^C^C^C

 

0 Kudos
Hemanth_K_Intel
Employee
2,245 Views

Due to minimal availability of ATS-P cards we have reconfigured few systems as dual cards systems for teh time being while we await arrival of additional cards.

Here are the node names.

s013-n008, 

s013-n009

s013-n010

s013-n011

Please ignore the quad labeled machine for now till be add more cards.

Also the oneAPI release has been installed on both public and NDA sections. Will try the MPI on the above systems.

0 Kudos
AlekhyaV_Intel
Moderator
2,087 Views

Hi Robert,


We apologize for the delay. There is some issue at the backend & we are trying to fix the issue with the nodes. We will resolve this issue soon.


Regards,

Alekhya


0 Kudos
AlekhyaV_Intel
Moderator
1,911 Views

Hi Robert,


As per your response through private mail, we are closing this thread. If you need any further assistance, please post a new question as this thread will no longer be monitored by Intel.


Regards,

Alekhya


0 Kudos
Reply