- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I submit my job into the queue via qsub command, its getting killed after 2 epochs. It seems that the reason is limited cput resources. I'm trying to train my YOLOv7 model on Intel DevCloud with intel extension on pytorch via CPU.
The submitted job outputs the following details:
########################################################################
# Date: Sat 24 Jun 2023 11:52:18 PM PDT
# Job ID: 2329188.v-qsvr-1.aidevcloud
# User: u195874
# Resources: cput=75:00:00,neednodes=1:icx:ppn=2,nodes=1:icx:ppn=2,walltime=06:00:00
########################################################################
The error is as following:
>> PBS: job killed: cput 272097 exceeded limit 270000
Can you please help me if there is any workaround possible. Also help me with distributed training maybe?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Communities.
Intel DevCloud for oneAPI nodes have a CPU time limit of 75 hours (270000 seconds). It is already mentioned in devcloud as below:
########################################################################
# Date: Wed 29 June 2023 03:20:37 AM PDT
# Job ID: ****359.v-qsvr-1.aidevcloud
# User: ******
# Resources: cput=75:00:00,neednodes=1:batch:ppn=2,nodes=1:batch:ppn=2,walltime=06:00:00
########################################################################
This is the reason why you are getting the "PBS: job killed: cput 272097 exceeded limit 270000" error. So the job will get removed from the node.
And regarding the distributed training, we are working on this internally and we will get back to you soon with an update. Meanwhile, you can access multiple nodes in DevCloud using the below command:
qsub -I -l nodes=2:<property>:ppn=2 -d .
If you want to list the nodes allocated to you, please use below command:
echo $PBS_NODEFILE
After this you will get a path. You can open that file with cat command.
Regards,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We apologize for the delay caused. Could you please provide us your YOLOV7 Model i.e. the complete reproducer so that we can reproduce your issue from our end.
Thanks,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. Could you please give us an update regarding this issue?
Regards,
Alekhya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back you, we shall close this thread now. If you need any further information, please post a new question as this thread will no longer be monitored by Intel.
Regards,
Alekhya
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page