Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1655 Discussions

Interactive jobs fail to start

User01
New Contributor II
917 Views

As of the time of this post, Tuesday, November 14, 2023, 9:09:57 AM ET, Intel DevCloud is not starting interactive jobs. Typically, "qsub -I" returns almost immediately with a connection to a compute node. Currently, it is stuck on waiting for job:

u123456@login-2:~$ qsub -I
qsub: waiting for job 2429415.v-qsvr-1.aidevcloud to start

 

Labels (1)
0 Kudos
7 Replies
AlekhyaV_Intel
Moderator
885 Views

Hi,

 

Thank you for posting in Intel Communities. This could be temporary issue due to a heavy load of users on devcloud.

Could you please try to request an interactive node again and let us know if the issue still persists?

Please follow this link for more information: https://devcloud.intel.com/oneapi/documentation/job-submission/

 

If this resolves your issue, make sure to accept this as solution. This helps others with similar issues. Thank you!

 

Regards,

Alekhya

 

 

0 Kudos
User01
New Contributor II
875 Views

This is still a problem as of the time of this post. To determine if this is a temporary issue due to a heavy load of users on DevCloud, please review the job system status to see the current load on the system.

0 Kudos
User01
New Contributor II
862 Views

This is still a problem as of the time of this post, Thursday, November 16, 2023, 9:06:55 AM ET.

0 Kudos
User01
New Contributor II
859 Views

A work-around is to specify node attributes to something other than the default:

 

$ qsub -I -l nodes=1:gold6348:ppn=2

 

I have not tried all of the variations, but it seems some set of nodes is locked up. The core and gold6348 seem to be responsive, but whatever conditions are the default with submitting with only "qsub -I" are triggering the problem.

 

$ pbsnodes | grep "properties =" | awk '{print $3}' | sort | uniq -c
22 core,tgl,i9-11900kb,ram32gb,netgbe,gpu,gen11
78 xeon,cfl,e-2176g,ram64gb,net1gbe,gpu,gen9
4 xeon,clx,ram192gb,net1gbe,batch,extended,fpga,stratix10,fpga_runtime
6 xeon,icx,gold6348,ramgb,netgbe,jupyter,batch
7 xeon,icx,plat8380,ram2tb,net1gbe,batch
3 xeon,skl,gold6128,ram192gb,net1gbe,fpga_runtime,fpga,agilex
12 xeon,skl,gold6128,ram192gb,net1gbe,fpga_runtime,fpga,arria10
79 xeon,skl,gold6128,ram192gb,net1gbe,jupyter,batch
26 xeon,skl,gold6128,ram192gb,net1gbe,jupyter,batch,fpga_compile
12 xeon,skl,ram384gb,net1gbe,renderkit
4 xeon,spr,max9480,ram256gb,netgbe,batch,hbm
1 xeon,spr,plat8480,ram512gb,netgbe,dual_gpu,hbm2e,gpu,max,max_1100
2 xeon,spr,ram1024gb,netgbe,dnp50

0 Kudos
AlekhyaV_Intel
Moderator
806 Views

Hi,


Thanks for sharing your observations. We are unable to reproduce your issue from our oneAPI devcloud account, hence, we are assuming this could be an account specific. Could you please share your devcloud user-id so that we could debug your issue further?


Regards,

Alekhya


0 Kudos
User01
New Contributor II
781 Views

As of Mon 20 Nov 2023 09:02:50 AM EST the problem is not recurring. A test run at this time succeeds. Running`qsub -I` responded with an interactive Bash prompt on a compute node with only a second or so of delay.

 

My DevCloud user ID and the job ID it created are below:

u133615@login-2:~$ qsub -I
qsub: waiting for job 2433037.v-qsvr-1.aidevcloud to start
qsub: job 2433037.v-qsvr-1.aidevcloud ready


########################################################################
# Date: Mon 20 Nov 2023 06:01:49 AM PST
# Job ID: 2433037.v-qsvr-1.aidevcloud
# User: u133615
# Resources: cput=75:00:00,neednodes=1:batch:ppn=2,nodes=1:batch:ppn=2,walltime=06:00:00
########################################################################

 

0 Kudos
AlekhyaV_Intel
Moderator
671 Views

Hi,


Glad to know that you're able to connect to the compute node. If you face this issue again, please post a new question as this thread will no longer be monitored by Intel.


Regards,

Alekhya


0 Kudos
Reply