Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
Announcements
The Intel sign-in experience is changing in February to support enhanced security controls. If you sign in, click here for more information.
1217 Discussions

Wrong job elapsed time

FelipeML
New Contributor I
425 Views

Hi,

I have been experiencing some problems with queued jobs for a few days now.
The first thing I noticed was that although I explicitly indicate that I want a walltime of X hours, the qstat information shows 35h:

FelipeML_0-1669368973924.png

The other problem is that the tracked elapsed time is wrong, which causes the jobs to be automatically cancelled early. Here is an example: As you can see in the following screenshot, the execution time is 24h 03min 47secs

FelipeML_1-1669369226687.png

This is wrong because after canceling the job and observing the elapsed time in the system clock you can see that only 4h 12min have elapsed:

 

########################################################################
#      Date:           Thu 24 Nov 2022 09:22:07 PM PST
#    Job ID:           2055054.v-qsvr-1.aidevcloud
#      User:           u137524
# Resources:           cput=35:00:00,neednodes=1:gold6128:ppn=2,nodes=1:gold6128:ppn=2,walltime=10:00:00
########################################################################
...
########################################################################
# End of output for job 2055054.v-qsvr-1.aidevcloud
# Date: Fri 25 Nov 2022 01:24:47 AM PST
########################################################################

 

Did you notice any similar behavior or is it just my problem?

Thank you very much for your help!

0 Kudos
1 Solution
AthiraM_Intel
Moderator
286 Views

Hi,


>>Have you been able to verify this behavior?


Yes, we have been able to verify this behavior from our side.


Intel DevCloud for oneAPI nodes have a CPU time limit of 35 hours (126000 seconds). This is the reason why the job is getting removed from the node. You can see this in the error file as given below:


uXXXX@login-2:~$ cat job.sh.eXXXXXX

=>> PBS: job killed: cput 126579 exceeded limit 126000


>>And if so, can you tell me if it will be like this from now on?


Yes, the CPU time is limited to 35 hours.


If you have any further issue, please let us know.



Regards,

Athira




View solution in original post

8 Replies
FelipeML
New Contributor I
411 Views

Reviewing old results, I just saw that the parameter "cput=35:00:00" didn't appear in the headers. Eg: a execution on November 3

########################################################################
#      Date:           Thu 03 Nov 2022 01:58:25 AM PDT
#    Job ID:           2024973.v-qsvr-1.aidevcloud
#      User:           u137524
# Resources:           neednodes=1:gold6128:ppn=2,nodes=1:gold6128:ppn=2,walltime=05:00:00
########################################################################
FelipeML
New Contributor I
358 Views

Hi,

can anyone confirm me if you have now started limiting by CPU time?

Thanks

AthiraM_Intel
Moderator
339 Views

Hi,

 

Thank you for posting in Intel Community.

 

We have tried to run a sample code with walltime 12 hours and it ran for 12 hours successfully.

 

Please find the below log:

 

 

########################################################################

#   Date:      Mon 28 Nov 2022 06:04:52 AM PST

#  Job ID:      2060651.v-qsvr-1.aidevcloud

#   User:      uxxxxx

# Resources:      cput=35:00:00,neednodes=1:batch:ppn=2,nodes=1:batch:ppn=2,walltime=12:00:00

########################################################################

......

 

########################################################################

# End of output for job 2060651.v-qsvr-1.aidevcloud

# Date: Mon 28 Nov 2022 06:05:40 PM PST

########################################################################

 

If your program stops before the walltime, it will stop running. You can run a walltime maximum of 24 hours in DevCloud.

 

We are attaching one sample infinite running program. Could you please try to run the program and let us know if you face any issue.

 

Command to run the program: qsub job.sh

 

 

Thanks

 

 

 

 

 

FelipeML
New Contributor I
330 Views

Hi @AthiraM_Intel 

Sure, this works perfectly, but have you tried running a code that executes work in parallel?

There is a difference between wall-clock-time and user-cpu-time.

Could you please try to run the attached program.

Thank you

FelipeML
New Contributor I
295 Views

Hi,

Have you been able to verify this behavior?

And if so, can you tell me if it will be like this from now on?

Thank you.

AthiraM_Intel
Moderator
287 Views

Hi,


>>Have you been able to verify this behavior?


Yes, we have been able to verify this behavior from our side.


Intel DevCloud for oneAPI nodes have a CPU time limit of 35 hours (126000 seconds). This is the reason why the job is getting removed from the node. You can see this in the error file as given below:


uXXXX@login-2:~$ cat job.sh.eXXXXXX

=>> PBS: job killed: cput 126579 exceeded limit 126000


>>And if so, can you tell me if it will be like this from now on?


Yes, the CPU time is limited to 35 hours.


If you have any further issue, please let us know.



Regards,

Athira




FelipeML
New Contributor I
283 Views

Hi @AthiraM_Intel 

No, I just wanted to be sure.

Thanks for your help.

AthiraM_Intel
Moderator
254 Views

Hi,


Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.



Thanks


Reply