- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a frozen job in DevCloud. Time quota was 6 hours, but it's been running for more than 62 hours.
I try to kill it with
qdel <job id>
but I get
qdel: Server could not connect to MOM <job id>
Any idea on what to do ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me add (for others having the same problem) that the DevCloud team finally cancelled my pending job.
A general good advice is to always include a deadline in your batch jobs to avoid any issue with the queueing system in case something strange happen.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have forwarded your issue to the owner of this Dev Cloud platform and awaiting to hear back. I would request for them to answer to your post directly. Please give us a couple of days on this.
-Hazlina
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you know which server you launched the job from? If so, you can log back into the same server, you can try ps -auxw and kill -9 the job ID. Sometimes that kills the job. Make sure you use the walltime construct in batch mode so you don't time out i the future.
Thanks,
Larry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me add if you post here and dont see a response, try fpgauniversity@intel.com . We have a fairly small team moderating technical inquiries on the FPGA devcloud, and dont check the forum frequently.
Thanks
Larry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Lawrence,
I already sent them 2 maills (last saturday, and yesterday) but I have no response.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me add (for others having the same problem) that the DevCloud team finally cancelled my pending job.
A general good advice is to always include a deadline in your batch jobs to avoid any issue with the queueing system in case something strange happen.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem is that the node s005-n005 that was running the job went down (I don't know why) and the queue system has lost the control of the job.
I cannot login to s005-n005 because it is not running.
Apparently (with admin privileges) the problem would be simply solved by running
qdel -p 18216.v-qsvr-fpga.aidevcloud

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page