Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1625 Discussions

Node hanging when training with multiple when using multiple workers

mirabar
Beginner
710 Views

I've been training a tensorflow classification model on DevCloud using:   qsub -l nodes=1:gpu:ppn=2 -l walltime=24:00:00
The model is fitted with the argumnets: workers=8, use_multiprocessing=True and also with an early stopping. 
I've been following the training process and upon reaching early stopping, the node starts to hang.
I ran the script on a different computational server and in seems to run with no problem. 
Any idea what might cause it and how to fix this problem?

0 Kudos
3 Replies
JananiC_Intel
Moderator
697 Views

Hi,


Thanks for posting in Intel forums.


We tried your command and we were not able to enter the computational node.Try the below command and let us know the update.


qsub -I -l nodes=1:gpu:ppn=2 -l walltime=24:00:00


Regards,

Janani Chandran




0 Kudos
JananiC_Intel
Moderator
687 Views

Hi,


Is the issue resolved?Do you have any update?


Regards,

Janani Chnadran


0 Kudos
JananiC_Intel
Moderator
672 Views

Hi,


We are assuming that the solution provided helped and would no longer be monitoring this issue.Please raise a new thread if you have any further issues.


Regards,

Janani Chandran


0 Kudos
Reply