- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have been experiencing some problems with queued jobs for a few days now.
The first thing I noticed was that although I explicitly indicate that I want a walltime of X hours, the qstat information shows 35h:
The other problem is that the tracked elapsed time is wrong, which causes the jobs to be automatically cancelled early. Here is an example: As you can see in the following screenshot, the execution time is 24h 03min 47secs
This is wrong because after canceling the job and observing the elapsed time in the system clock you can see that only 4h 12min have elapsed:
########################################################################
# Date: Thu 24 Nov 2022 09:22:07 PM PST
# Job ID: 2055054.v-qsvr-1.aidevcloud
# User: u137524
# Resources: cput=35:00:00,neednodes=1:gold6128:ppn=2,nodes=1:gold6128:ppn=2,walltime=10:00:00
########################################################################
...
########################################################################
# End of output for job 2055054.v-qsvr-1.aidevcloud
# Date: Fri 25 Nov 2022 01:24:47 AM PST
########################################################################
Did you notice any similar behavior or is it just my problem?
Thank you very much for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
>>Have you been able to verify this behavior?
Yes, we have been able to verify this behavior from our side.
Intel DevCloud for oneAPI nodes have a CPU time limit of 35 hours (126000 seconds). This is the reason why the job is getting removed from the node. You can see this in the error file as given below:
uXXXX@login-2:~$ cat job.sh.eXXXXXX
=>> PBS: job killed: cput 126579 exceeded limit 126000
>>And if so, can you tell me if it will be like this from now on?
Yes, the CPU time is limited to 35 hours.
If you have any further issue, please let us know.
Regards,
Athira
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Reviewing old results, I just saw that the parameter "cput=35:00:00" didn't appear in the headers. Eg: a execution on November 3
########################################################################
# Date: Thu 03 Nov 2022 01:58:25 AM PDT
# Job ID: 2024973.v-qsvr-1.aidevcloud
# User: u137524
# Resources: neednodes=1:gold6128:ppn=2,nodes=1:gold6128:ppn=2,walltime=05:00:00
########################################################################
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
can anyone confirm me if you have now started limiting by CPU time?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Community.
We have tried to run a sample code with walltime 12 hours and it ran for 12 hours successfully.
Please find the below log:
########################################################################
# Date: Mon 28 Nov 2022 06:04:52 AM PST
# Job ID: 2060651.v-qsvr-1.aidevcloud
# User: uxxxxx
# Resources: cput=35:00:00,neednodes=1:batch:ppn=2,nodes=1:batch:ppn=2,walltime=12:00:00
########################################################################
......
########################################################################
# End of output for job 2060651.v-qsvr-1.aidevcloud
# Date: Mon 28 Nov 2022 06:05:40 PM PST
########################################################################
If your program stops before the walltime, it will stop running. You can run a walltime maximum of 24 hours in DevCloud.
We are attaching one sample infinite running program. Could you please try to run the program and let us know if you face any issue.
Command to run the program: qsub job.sh
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, this works perfectly, but have you tried running a code that executes work in parallel?
There is a difference between wall-clock-time and user-cpu-time.
Could you please try to run the attached program.
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Have you been able to verify this behavior?
And if so, can you tell me if it will be like this from now on?
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
>>Have you been able to verify this behavior?
Yes, we have been able to verify this behavior from our side.
Intel DevCloud for oneAPI nodes have a CPU time limit of 35 hours (126000 seconds). This is the reason why the job is getting removed from the node. You can see this in the error file as given below:
uXXXX@login-2:~$ cat job.sh.eXXXXXX
=>> PBS: job killed: cput 126579 exceeded limit 126000
>>And if so, can you tell me if it will be like this from now on?
Yes, the CPU time is limited to 35 hours.
If you have any further issue, please let us know.
Regards,
Athira
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Thanks
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page