Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1204 Discussions

a10 node crashes and job does not terminate

z24tao
Beginner
423 Views

Hi,

 

I'm running a process on an arria 10 node (release 1.2.1) via devcloud_login (job id 31409.v-qsvr-fpga.aidevcloud). It appears that the job crashed the node and I am no longer able to connect to any other compute node. I am also unable to kill the job as:

 

qdel 31409.v-qsvr-fpga.aidevcloud

returns:

qdel: Server could not connect to MOM 31409.v-qsvr-fpga.aidevcloud

 

and

qdel -p 31409.v-qsvr-fpga.aidevcloud

returns:

qdel: Unauthorized Request 31409.v-qsvr-fpga.aidevcloud

 

I believe this problem fixes itself in a few (6) hours as the job times out, but is there any way to resolve it?

0 Kudos
1 Solution
BoonBengT_Intel
Moderator
345 Views

Hi @z24tao


I would say yes there is something that we can do, when it crash, presumably you got logged out from devcloud.

Hence for that I would suggest to logged back into the same nodes (i.e. you will need to note down the node connected or which node jobs are computing on), followed by running the watch command and qdel the jobs.

Hope that helps.


Best Wishes

BB


View solution in original post

4 Replies
BoonBengT_Intel
Moderator
362 Views

Hi @z24tao


Thank you for posting in Intel community forum and hope all is well.

I am unsure what kind of workload are being compute on the node hence not able to comment much on why the crashed happenned.

But for the progress/status of the job submitted in the devcloud perhaps you can use the command below to check:

- watch -n 1 qstat -n -1


Would also maybe to check on the nodes spec to get the right nodes with the appropriate compute power for your design, you would be able to do so dia check on the nodes list with the command 'pbsnodes'.


Note: more details on the command can be found in the link here --> https://devcloud.intel.com/oneapi/documentation/job-submission/


Hope that clarify.

Best Wishes

BB


z24tao
Beginner
356 Views

Hey,

 

Thanks for the reply! I'm not entirely sure how to describe the workload being compute on the node, but in the case that it does crash, is there anything I can do rather than waiting for it to fix itself?

 

Best,

Derek

BoonBengT_Intel
Moderator
346 Views

Hi @z24tao


I would say yes there is something that we can do, when it crash, presumably you got logged out from devcloud.

Hence for that I would suggest to logged back into the same nodes (i.e. you will need to note down the node connected or which node jobs are computing on), followed by running the watch command and qdel the jobs.

Hope that helps.


Best Wishes

BB


BoonBengT_Intel
Moderator
338 Views

Hi @z24tao,


Good to know that we managed to clarify your doubts, it will be transitioned to community support for further help on doubts in this thread, where we will no longer monitor this thread.

Thank you for the questions and as always pleasure having you here.


Best Wishes

BB


Reply