Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
832 Discussions

Node hanging when training with multiple when using multiple workers

mirabar
Beginner
280 Views

I've been training a tensorflow classification model on DevCloud using:   qsub -l nodes=1:gpu:ppn=2 -l walltime=24:00:00
The model is fitted with the argumnets: workers=8, use_multiprocessing=True and also with an early stopping. 
I've been following the training process and upon reaching early stopping, the node starts to hang.
I ran the script on a different computational server and in seems to run with no problem. 
Any idea what might cause it and how to fix this problem?

0 Kudos
3 Replies
JananiC_Intel
Moderator
267 Views

Hi,


Thanks for posting in Intel forums.


We tried your command and we were not able to enter the computational node.Try the below command and let us know the update.


qsub -I -l nodes=1:gpu:ppn=2 -l walltime=24:00:00


Regards,

Janani Chandran




JananiC_Intel
Moderator
257 Views

Hi,


Is the issue resolved?Do you have any update?


Regards,

Janani Chnadran


JananiC_Intel
Moderator
242 Views

Hi,


We are assuming that the solution provided helped and would no longer be monitoring this issue.Please raise a new thread if you have any further issues.


Regards,

Janani Chandran


Reply