I've been training a tensorflow classification model on DevCloud using: qsub -l nodes=1:gpu:ppn=2 -l walltime=24:00:00
The model is fitted with the argumnets: workers=8, use_multiprocessing=True and also with an early stopping.
I've been following the training process and upon reaching early stopping, the node starts to hang.
I ran the script on a different computational server and in seems to run with no problem.
Any idea what might cause it and how to fix this problem?
Thanks for posting in Intel forums.
We tried your command and we were not able to enter the computational node.Try the below command and let us know the update.
qsub -I -l nodes=1:gpu:ppn=2 -l walltime=24:00:00
We are assuming that the solution provided helped and would no longer be monitoring this issue.Please raise a new thread if you have any further issues.