Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1624 Discussions

Wrong job elapsed time

FelipeML
New Contributor I
1,159 Views

Hi,

I have been experiencing some problems with queued jobs for a few days now.
The first thing I noticed was that although I explicitly indicate that I want a walltime of X hours, the qstat information shows 35h:

FelipeML_0-1669368973924.png

The other problem is that the tracked elapsed time is wrong, which causes the jobs to be automatically cancelled early. Here is an example: As you can see in the following screenshot, the execution time is 24h 03min 47secs

FelipeML_1-1669369226687.png

This is wrong because after canceling the job and observing the elapsed time in the system clock you can see that only 4h 12min have elapsed:

 

########################################################################
#      Date:           Thu 24 Nov 2022 09:22:07 PM PST
#    Job ID:           2055054.v-qsvr-1.aidevcloud
#      User:           u137524
# Resources:           cput=35:00:00,neednodes=1:gold6128:ppn=2,nodes=1:gold6128:ppn=2,walltime=10:00:00
########################################################################
...
########################################################################
# End of output for job 2055054.v-qsvr-1.aidevcloud
# Date: Fri 25 Nov 2022 01:24:47 AM PST
########################################################################

 

Did you notice any similar behavior or is it just my problem?

Thank you very much for your help!

0 Kudos
1 Solution
AthiraM_Intel
Moderator
1,020 Views

Hi,


>>Have you been able to verify this behavior?


Yes, we have been able to verify this behavior from our side.


Intel DevCloud for oneAPI nodes have a CPU time limit of 35 hours (126000 seconds). This is the reason why the job is getting removed from the node. You can see this in the error file as given below:


uXXXX@login-2:~$ cat job.sh.eXXXXXX

=>> PBS: job killed: cput 126579 exceeded limit 126000


>>And if so, can you tell me if it will be like this from now on?


Yes, the CPU time is limited to 35 hours.


If you have any further issue, please let us know.



Regards,

Athira




View solution in original post

0 Kudos
8 Replies
FelipeML
New Contributor I
1,145 Views

Reviewing old results, I just saw that the parameter "cput=35:00:00" didn't appear in the headers. Eg: a execution on November 3

########################################################################
#      Date:           Thu 03 Nov 2022 01:58:25 AM PDT
#    Job ID:           2024973.v-qsvr-1.aidevcloud
#      User:           u137524
# Resources:           neednodes=1:gold6128:ppn=2,nodes=1:gold6128:ppn=2,walltime=05:00:00
########################################################################
0 Kudos
FelipeML
New Contributor I
1,092 Views

Hi,

can anyone confirm me if you have now started limiting by CPU time?

Thanks

0 Kudos
AthiraM_Intel
Moderator
1,073 Views

Hi,

 

Thank you for posting in Intel Community.

 

We have tried to run a sample code with walltime 12 hours and it ran for 12 hours successfully.

 

Please find the below log:

 

 

########################################################################

#   Date:      Mon 28 Nov 2022 06:04:52 AM PST

#  Job ID:      2060651.v-qsvr-1.aidevcloud

#   User:      uxxxxx

# Resources:      cput=35:00:00,neednodes=1:batch:ppn=2,nodes=1:batch:ppn=2,walltime=12:00:00

########################################################################

......

 

########################################################################

# End of output for job 2060651.v-qsvr-1.aidevcloud

# Date: Mon 28 Nov 2022 06:05:40 PM PST

########################################################################

 

If your program stops before the walltime, it will stop running. You can run a walltime maximum of 24 hours in DevCloud.

 

We are attaching one sample infinite running program. Could you please try to run the program and let us know if you face any issue.

 

Command to run the program: qsub job.sh

 

 

Thanks

 

 

 

 

 

0 Kudos
FelipeML
New Contributor I
1,064 Views

Hi @AthiraM_Intel 

Sure, this works perfectly, but have you tried running a code that executes work in parallel?

There is a difference between wall-clock-time and user-cpu-time.

Could you please try to run the attached program.

Thank you

0 Kudos
FelipeML
New Contributor I
1,029 Views

Hi,

Have you been able to verify this behavior?

And if so, can you tell me if it will be like this from now on?

Thank you.

0 Kudos
AthiraM_Intel
Moderator
1,021 Views

Hi,


>>Have you been able to verify this behavior?


Yes, we have been able to verify this behavior from our side.


Intel DevCloud for oneAPI nodes have a CPU time limit of 35 hours (126000 seconds). This is the reason why the job is getting removed from the node. You can see this in the error file as given below:


uXXXX@login-2:~$ cat job.sh.eXXXXXX

=>> PBS: job killed: cput 126579 exceeded limit 126000


>>And if so, can you tell me if it will be like this from now on?


Yes, the CPU time is limited to 35 hours.


If you have any further issue, please let us know.



Regards,

Athira




0 Kudos
FelipeML
New Contributor I
1,017 Views

Hi @AthiraM_Intel 

No, I just wanted to be sure.

Thanks for your help.

0 Kudos
AthiraM_Intel
Moderator
988 Views

Hi,


Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.



Thanks


0 Kudos
Reply