Application Acceleration With FPGAs
Programmable Acceleration Cards (PACs), DCP, FPGA AI Suite, Software Stack, and Reference Designs

Frozen Job in Devcloud

davidcastells
New Contributor I
969 Views

I have a frozen job in DevCloud. Time quota was 6 hours, but it's been running for more than 62 hours.

I try to kill it with

qdel <job id>

but I get 

qdel: Server could not connect to MOM <job id>

 

Any idea on what to do ?

0 Kudos
1 Solution
davidcastells
New Contributor I
817 Views

Let me add (for others having the same problem) that the DevCloud team finally cancelled my pending job.

A general good advice is to always include a deadline in your batch jobs to avoid any issue with the queueing system in case something strange happen. 

View solution in original post

0 Kudos
6 Replies
Hazlina_R_Intel
Moderator
954 Views

Hi,

I have forwarded your issue to the owner of this Dev Cloud platform and awaiting to hear back. I would request for them to answer to your post directly. Please give us a couple of days on this.


-Hazlina


0 Kudos
Lawrence_L_Intel
Employee
947 Views

Do you know which server you launched the job from? If so, you can log back into the same server, you can try ps -auxw and kill -9 the job ID. Sometimes that kills the job. Make sure you use the walltime construct in batch mode so you don't time out i the future.

Thanks,

Larry

0 Kudos
Lawrence_L_Intel
Employee
944 Views

Let me add if you post here and dont see a response, try fpgauniversity@intel.com . We have a fairly small team moderating technical inquiries on the FPGA devcloud, and dont check the forum frequently.

Thanks

Larry

 

0 Kudos
davidcastells
New Contributor I
938 Views

Thanks Lawrence,
I already sent them 2 maills (last saturday, and yesterday) but I have no response.

0 Kudos
davidcastells
New Contributor I
818 Views

Let me add (for others having the same problem) that the DevCloud team finally cancelled my pending job.

A general good advice is to always include a deadline in your batch jobs to avoid any issue with the queueing system in case something strange happen. 

0 Kudos
davidcastells
New Contributor I
939 Views

The problem is that the node s005-n005 that was running the job went down (I don't know why) and the queue system has lost the control of the job.

I cannot login to s005-n005 because it is not running.

Apparently (with admin privileges) the problem would be simply solved by running

qdel -p 18216.v-qsvr-fpga.aidevcloud

 

0 Kudos
Reply