- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I cannot run a simple mpi program on s012-n004. It hangs in MPI_Init
u50659@s012-n004:mpitest$ cat hello.cpp
#include <iostream>
#include "mpi.h"
int main(int argc, char * argv[]) {
std::cout << "Hello" << std::endl;
MPI_Init(&argc, &argv);
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
std::cout << "Hello from " << rank << " of " << size << "\n";
MPI_Finalize();
}
u50659@s012-n004:mpitest$ which mpicxx
/glob/development-tools/versions/oneapi/2022.1.2/oneapi/mpi/2021.5.1//bin/mpicxx
u50659@s012-n004:mpitest$ mpicxx hello.cpp
u50659@s012-n004:mpitest$ ldd a.out
linux-vdso.so.1 (0x00007fff3dd54000)
libmpi.so.12 => /glob/development-tools/versions/oneapi/2022.1.2/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12 (0x00007f53aa\
95c000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f53aa76b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f53aa579000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f53aa56e000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f53aa568000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f53aa417000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f53aa3fc000)
/lib64/ld-linux-x86-64.so.2 (0x00007f53ac19a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f53aa3d9000)
u50659@s012-n004:mpitest$ export I_MPI_DEBUG=3
u50659@s012-n004:mpitest$ mpirun -n 2 ./a.out
Hello
Hello
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robert,
Thank you for posting in Intel Communities. We couldn't access that particular node i.e. "s012-n004" as it is currently down. So we tried to reproduce your error in a different node i.e. "s001-n013" and it worked without producing any errors or hanging. We have attached the screenshot below.
So, we suggest you to try in a different node and update us if the issue still persists or not.
Hope this helps!
Regards,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It works fine for me on other nodes. When I request quad_gpu, it was giving me that node. I will try to get a quad_gpu on a different node.
Is there a page that describes the gpu configurations? This is what I think I see:
gpu: integrated gpu
dual_gpu: discrete gpu, no integrated gpu available
quad_gpu: 3 discrete gpus, no integrated gpu available
This morning, I am unable to get any nodes. It doesn't seem likely that none are available so I suspect I have some hung jobs that are blocking me. Is there something I need to do to clear the old jobs?
u50659@login-2:~$ qstat
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1840347.v-qsvr-1 STDIN u50659 00:00:04 R batch
1841484.v-qsvr-1 STDIN u50659 00:00:05 R batch
u50659@login-2:~$ qselect
1840347.v-qsvr-1.aidevcloud
1841484.v-qsvr-1.aidevcloud
u50659@login-2:~$ qselect | xargs qdel
qdel: Server could not connect to MOM 1840347.v-qsvr-1.aidevcloud
qdel: Server could not connect to MOM 1841484.v-qsvr-1.aidevcloud
u50659@login-2:~$ qsub -I -l nodes=1:quad_gpu:ppn=2 -d .
qsub: waiting for job 1841923.v-qsvr-1.aidevcloud to start
^CDo you wish to terminate the job and exit (y|[n])? y
Job 1841923.v-qsvr-1.aidevcloud is being deleted
u50659@login-2:~$ qsub -I -l nodes=1:dual_gpu:ppn=2 -d .
qsub: waiting for job 1841925.v-qsvr-1.aidevcloud to start
^CDo you wish to terminate the job and exit (y|[n])? y
Job 1841925.v-qsvr-1.aidevcloud is being deleted
u50659@login-2:~$ qsub -I -l nodes=1:ppn=2 -d .
qsub: waiting for job 1841926.v-qsvr-1.aidevcloud to start
^CDo you wish to terminate the job and exit (y|[n])? y
Job 1841926.v-qsvr-1.aidevcloud is being deleted
u50659@login-2:~$
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Robert_C_Intel ,
- As you wanted to confirm about GPUs available on DevCloud, currently we don't have any documentation available. Could you please elaborate on what all you would like to know regarding the GPUs so that we can work on it internally.
- And about the quad_gpu compute nodes, you can check the list of compute nodes with the below command:
pbsnodes
The above command gives all the compute nodes present in devcloud along with their properties and state.
If you would like to list all the nodes with a particular property, please follow the below command:
pbsnodes | grep "<property>" -A 6 -B 4
Example:
For quad_gpu & dual_gpu nodes,
pbsnodes | grep "quad_gpu" -A 6 -B 4
pbsnodes | greo "dual_gpu" -A 6 -B 4
The output looks like below:
s012-n003
state = free
power_state = Running
np = 2
properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,quad_gpu
ntype = cluster
status = rectime=1643698603,macaddr=d4:5d:64:08:fb:3c,cpuclock=Fixed,varattr=,jobs=,state=free,netload=42477541384,gres=,loadave=9.00,ncpus=24,physmem=32562336kb,availmem=33649520kb,totmem=34659484kb,idletime=127019,nusers=3,nsessions=4,sessions=126670 126685 1166623 1172551,uname=Linux s012-n003 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
You could check for free nodes and try requesting the same using the below command:
qsub -I -l nodes=<node number>:ppn=2
- For deleting a job, please try with the below commands:
qdel <job-ID>
or
qdel all
If the jobs are not deleted even after trying with the above commands, please send us the Job-IDs. We will try deleting them from our end.
Regards,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robert,
We could see that you've shared the job IDs in this thread(https://community.intel.com/t5/Intel-DevCloud/Cannot-get-interactive-node/m-p/1356356/emcs_t/S2h8ZW1haWx8Ym9hcmRfc3Vic2NyaXB0aW9ufEtaNDZUQjUwTUtQMVNPfDEzNTYzNTZ8U1VCU0NSSVBUSU9OU3xoSw#M4240). Could you please confirm whether the job IDs mentioned in the above thread are the ones you wanted to delete?
Also, please specify what all you want to know about GPU configurations on devcloud so that we can work on it internally.
Regards,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The jobs have been deleted. Now that I can submit jobs again, I can look at the gpu info you provided.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the tip on the getting GPU info. It is sufficient.
I still can't run MPI on quad gpu systems. It hangs in MPI_Init. This time I am trying on s012-n002. See above for the sample program and how to compile it. Here is what happens. I cannot kill it.
u50659@s012-n002:mpitest$ ldd a.out
linux-vdso.so.1 (0x00007ffd73ff8000)
libmpi.so.12 => /glob/development-tools/versions/oneapi/2022.1.2/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12 (0x00007f73becc3000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f73bead2000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f73be8e0000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f73be8d5000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f73be8cf000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f73be77e000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f73be763000)
/lib64/ld-linux-x86-64.so.2 (0x00007f73c0501000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f73be740000)
u50659@s012-n002:mpitest$ ls
a.out hello.cpp hello.cpp~
u50659@s012-n002:mpitest$ I_MPI_DEBUG=3 ./a.out
Hello
MPI startup(): PMI server not found. Please set I_MPI_PMI_LIBRARY variable if it is not a singleton case.
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
^C^C^C^C
^C^C^C^C^C^Z
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robert,
We apologize for the inconvenience caused. We could observe that MPI_Init hangs in all the quad_gpu nodes. We are working on this issue internally and will get back to you soon with an update.
Regards,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Robert,
We got an update from the admin team that the issue is resolved. MPI samples are working fine on the following quad_gpu nodes: s012-n004, s012-n003 & s012-n002. Could you please check & confirm the same?
Regards,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I still cannot run a MPI hello world program. Everything hangs.
u50659@s012-n002:mpitest$ sycl-ls
[opencl:0] ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.13.11.0.23_160000]
[opencl:0] CPU : Intel(R) OpenCL 3.0 [2021.13.11.0.23_160000]
[opencl:0] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.49.21786]
[opencl:1] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.49.21786]
[opencl:2] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.49.21786]
[level_zero:0] GPU : Intel(R) Level-Zero 1.2 [1.2.21786]
[level_zero:1] GPU : Intel(R) Level-Zero 1.2 [1.2.21786]
[level_zero:2] GPU : Intel(R) Level-Zero 1.2 [1.2.21786]
[host:0] HOST: SYCL host platform 1.2 [1.2]
u50659@s012-n002:mpitest$ mpirun -n 3 ./a.out
Hello
Hello
Hello
^C[mpiexec@s012-n002] Sending Ctrl-C to processes as requested
[mpiexec@s012-n002] Press Ctrl-C again to force abort
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 17626 RUNNING AT s012-n002
= KILLED BY SIGNAL: 2 (Interrupt)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 17627 RUNNING AT s012-n002
= KILLED BY SIGNAL: 2 (Interrupt)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 17628 RUNNING AT s012-n002
= KILLED BY SIGNAL: 2 (Interrupt)
===================================================================================
u50659@s012-n002:mpitest$ ^C
u50659@s012-n002:mpitest$ sycl-ls
^C^C^C^C^C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Due to minimal availability of ATS-P cards we have reconfigured few systems as dual cards systems for teh time being while we await arrival of additional cards.
Here are the node names.
s013-n008,
s013-n009
s013-n010
s013-n011
Please ignore the quad labeled machine for now till be add more cards.
Also the oneAPI release has been installed on both public and NDA sections. Will try the MPI on the above systems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robert,
We apologize for the delay. There is some issue at the backend & we are trying to fix the issue with the nodes. We will resolve this issue soon.
Regards,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robert,
As per your response through private mail, we are closing this thread. If you need any further assistance, please post a new question as this thread will no longer be monitored by Intel.
Regards,
Alekhya

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page