Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
Announcements
The Intel sign-in experience is changing in February to support enhanced security controls. If you sign in, click here for more information.
1217 Discussions

How can I allocate a dual or a quad GPU compute node?

brucechen057
Beginner
872 Views

How can I allocate a dual or a quad Intel® Iris® Xe MAX Graphics compute node? 

 

The command in the guide (https://devcloud.intel.com/oneapi/documentation/job-submission/) suggests something like:

qsub -l nodes=1:iris_xe_max:dual_gpu:ppn=2 job_script.sh 

Screen Shot 2022-05-05 at 1.51.19 PM.png

 

However, this seems no longer supported:

$ qsub -I -l nodes=1:iris_xe_max:dual_gpu:ppn=2
qsub: submit error (Job exceeds queue resource limits MSG=cannot locate feasible nodes (nodes file is empty, all systems are busy, or no nodes have the requested feature))

 

And I can that kind of property anymore

$ pbsnodes | sort | grep properties | grep iris_xe_max | uniq
properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,gpu

0 Kudos
14 Replies
AlekhyaV_Intel
Moderator
841 Views

Hi,

 

Thank you for posting in Intel Communities. Due to some internal issues, the quad_gpu & dual_gpu nodes are not available to the public. It is a known issue & admin team is addressing this. It might take some time for those nodes to be available to public. As of now, you can connect with the compute nodes which hosts a single GPU.

 

Regards,

Alekhya

 

brucechen057
Beginner
803 Views

Thanks for your response.
Another question regarding multiple GPUs, is there any tool, like nvidia-smi for NVIDIA GPUs, that can check how many GPUs are within a node? How can I check the GPU usage and make sure I'm running the kernels on GPUs?

AlekhyaV_Intel
Moderator
796 Views

Hi,

 

You can view the details of the GPU, CPU etc in that specific node via below command:

 

sycl-ls

 

 

Regards,

Alekhya

 

 

DanDaMan
New Contributor I
789 Views

s012-n004 seems broken


s012-n004:~$ sycl-ls --verbose
Abort was called at 400 line in file:
/opt/src/opencl/shared/source/os_interface/linux/drm_neo.cpp
Aborted

vs

s011-n003:~$ sycl-l
[opencl:0] ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.13.11.0.23_160000]
[opencl:0] CPU : Intel(R) OpenCL 3.0 [2021.13.11.0.23_160000]
[opencl:0] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.49.21786]
[opencl:1] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.49.21786]
[level_zero:0] GPU : Intel(R) Level-Zero 1.2 [1.2.21786]
[level_zero:1] GPU : Intel(R) Level-Zero 1.2 [1.2.21786]
[host:0] HOST: SYCL host platform 1.2 [1.2]

I tried posting a message about this, but it got marked as spam (Update: now fixed)

AlekhyaV_Intel
Moderator
757 Views

Hi @DanDaMan ,

 

Thank you for sharing your observations. We could reproduce your issue. There might be some issue with that node. We will connect with the dev team & get it fixed.

 

Regards,

Alekhya

 

AlekhyaV_Intel
Moderator
698 Views

Hi @DanDaMan ,

 

We could observe that you raised a new thread in the community regarding the issue with s012-n004 node in DevCloud. So we will continue monitoring your issue in the other thread(https://community.intel.com/t5/Intel-DevCloud/s012-n004-Broken-OneAPI-VPL-GPU-iris-xe-max-not-workin...)

 

Regards,

Alekhya

 

AlekhyaV_Intel
Moderator
692 Views

Hi @brucechen057 ,

 

Can we discontinue monitoring this thread? Please give us an update.

 

Regards,

Alekhya

 

DanDaMan
New Contributor I
681 Views

Unfortunately the other thread has been closed, and s012-n004 is still broken, and still available as if it was working:

########################################################################
#      Date:           Tue 17 May 2022 08:02:39 AM PDT
#    Job ID:           1909065.v-qsvr-1.aidevcloud
#      User:           uXXXXXX
# Resources:           neednodes=s012-n004:iris_xe_max:ppn=2,nodes=s012-n004:iris_xe_max:ppn=2,walltime=24:00:00
########################################################################

uXXXXXX@s012-n004:~$ sycl-ls --verbose
Abort was called at 456 line in file:
/opt/src/opencl/shared/source/os_interface/linux/drm_neo.cpp
Aborted

 While there is a workaround by manually searching for and allocating nodes, is there a timeline for when this might be fixed, or at least removed from the pool of iris_xe_max/gpu nodes so that jobs which rely on those features aren't randomly failing?

AlekhyaV_Intel
Moderator
633 Views

Hi @DanDaMan ,

 

We apologize for the inconvenience caused. We have contacted the admin team regarding this issue and they're trying to fix that particular node. It might take some time for that node to be fixed. You could continue your work on other working nodes for now.

 

Regards,

Alekhya

 

AlekhyaV_Intel
Moderator
542 Views

Hi @DanDaMan,

 

We tried the below command in s012-n004 node and it seems like the node is working fine now. Could you please check if the issue still persists or not?

sycl-ls --verbose

Please update us.

 

Regards,

Alekhya

 

DanDaMan
New Contributor I
529 Views

It's different, but still broken. Now it just hangs... 

uXXXXXX@s012-n004:~$ sycl-ls
    [...hang]

vs

uXXXXXXX@s001-n012:~$ sycl-ls --verbose
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.13.3.0.16_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz 3.0 [2022.13.3.0.16_160000]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

Platforms: 3
Platform [#1]:
    Version  : OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
    Name     : Intel(R) FPGA Emulation Platform for OpenCL(TM)
    Vendor   : Intel(R) Corporation
   [...etc]

 

DanDaMan
New Contributor I
485 Views

s012-n004 seems to be working now, although a bunch of other xe nodes seems to be various broken states ( see https://community.intel.com/t5/Intel-DevCloud/Video-support-removed-from-iris-xe-max-nodes/m-p/13965... )

s012-n004:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.13.3.0.16_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz 3.0 [2022.13.3.0.16_160000]
[opencl:gpu:2] Intel(R) OpenCL HD Graphics, Intel(R) Iris(R) Xe MAX Graphics [0x4905] 3.0 [22.10.22597]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe MAX Graphics [0x4905] 1.3 [1.3.22597]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

 

DanDaMan
New Contributor I
503 Views

Sorry, just notced I used syscl-ls but forgot --verbose

Can confirm s012-n004 seems to be working now for --verbose, but hangs without it, but s001-n012 works fine without --verbose.

 

s001-n012:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.13.3.0.16_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz 3.0 [2022.13.3.0.16_160000]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

 

 

AlekhyaV_Intel
Moderator
458 Views

Hi @DanDaMan ,


Issue regarding various Iris Xe Max broken nodes is being handled in this thread(https://community.intel.com/t5/Intel-DevCloud/Video-support-removed-from-iris-xe-max-nodes/m-p/13965...)


As you've confirmed that s012-n004 node is working fine, we are closing this thread now. If you need any further assistance, please post a new question as this thread will no longer be monitored by Intel.


Regards,

Alekhya


Reply