Hello, when I use coco data set training model, more than 10000 pictures show that I need to train for more than 100 hours. How can I check whether the training is running on CPU or GPU?How to know the usage rate of GPU At the same time, I use "#PBS -l walltime=24:00:00" in the run.sh file. But I still can't change the time of walltime. What should I do? I look forward to your reply. Thank you!
- General Support
Thank you for reaching us.
We don't have GPU nodes but we have iGPU nodes on DevCloud, for requesting those nodes use the below command in the job script that you are submitting:
#PBS -l nodes=1:gpu
We are sorry to inform you that 24hrs is the max walltime possible in devcloud.
However , you can try the optimizations on CPU itself to get improved performance.
Please follow the below urls for more details on Optimizing Tensorflow workloads on CPU.
- To submit a job in Devcloud
- Once job is submitted, you can track the job using the below command:
- To read the output and error stream of the executing job, you can use the qpeek command as below:
1. qpeek -o <job_id>
2. qpeek -e <job_id>
Please note that an output and error file will be created once the execution is completed.
Hope this clarifies your query. Please feel free to reach out to us if you have any further queries. Thank You.
okay, thank you. I also have a question, whether to use #PBS -l nodes=1:gpu to calculate on the igpu node, why does it feel like the speed of running it directly on the jupyter notebook without setting CPU / GPU? What's more, can I speed up the operation by changing the number of nodes? Can I know more about the speed of igpu? Thank you for your reply!
#PBS -l nodes=1:gpu is requesting an iGPU node,hence your code will be running in iGPU.
You can try to optimize your code and increase the speedup by tweaking the OMP/KMP parameters for improving perfomance.
Increasing the number of nodes may or maynot increase the speed based on your application.You could give a try running in a distributed way and see if that works.
After improvement, the speed is a little faster, but the problem of training time-out still hasn't been solved. I will improve the code again. Thank you. The topic can be closed.