Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1795 Discussions

Job submitted via qsub getting killed due to limited cput resources

karnikkanojia
Beginner
1,084 Views

When I submit my job into the queue via qsub command, its getting killed after 2 epochs. It seems that the reason is limited cput resources. I'm trying to train my YOLOv7 model on Intel DevCloud with intel extension on pytorch via CPU.

 

The submitted job outputs the following details:

########################################################################
#      Date:           Sat 24 Jun 2023 11:52:18 PM PDT
#    Job ID:           2329188.v-qsvr-1.aidevcloud
#      User:           u195874
# Resources:           cput=75:00:00,neednodes=1:icx:ppn=2,nodes=1:icx:ppn=2,walltime=06:00:00
########################################################################

 

The error is as following:

>> PBS: job killed: cput 272097 exceeded limit 270000

 

Can you please help me if there is any workaround possible. Also help me with distributed training maybe?

0 Kudos
4 Replies
AlekhyaV_Intel
Moderator
1,026 Views

Hi,

 

Thank you for posting in Intel Communities.

 

Intel DevCloud for oneAPI nodes have a CPU time limit of 75 hours (270000 seconds). It is already mentioned in devcloud as below:

 

########################################################################
#   Date:      Wed 29 June 2023 03:20:37 AM PDT
#  Job ID:      ****359.v-qsvr-1.aidevcloud
#   User:      ******
# Resources:      cput=75:00:00,neednodes=1:batch:ppn=2,nodes=1:batch:ppn=2,walltime=06:00:00
########################################################################

 

This is the reason why you are getting the "PBS: job killed: cput 272097 exceeded limit 270000" error. So the job will get removed from the node.

 

And regarding the distributed training, we are working on this internally and we will get back to you soon with an update. Meanwhile, you can access multiple nodes in DevCloud using the below command:

qsub -I -l nodes=2:<property>:ppn=2 -d .

 

If you want to list the nodes allocated to you, please use below command:

echo $PBS_NODEFILE

After this you will get a path. You can open that file with cat command.

AlekhyaV_Intel_0-1688067019773.png

 

Regards,

Alekhya

 

0 Kudos
AlekhyaV_Intel
Moderator
936 Views

Hi,


We apologize for the delay caused. Could you please provide us your YOLOV7 Model i.e. the complete reproducer so that we can reproduce your issue from our end.


Thanks,

Alekhya


0 Kudos
AlekhyaV_Intel
Moderator
885 Views

Hi,


We have not heard back from you. Could you please give us an update regarding this issue?


Regards,

Alekhya


0 Kudos
AlekhyaV_Intel
Moderator
850 Views

Hi,


We have not heard back you, we shall close this thread now. If you need any further information, please post a new question as this thread will no longer be monitored by Intel.


Regards,

Alekhya


0 Kudos
Reply