Community
cancel
Showing results for 
Search instead for 
Did you mean: 
mirabar
Beginner
111 Views

Node hanging when training with multiple when using multiple workers

I've been training a tensorflow classification model on DevCloud using:   qsub -l nodes=1:gpu:ppn=2 -l walltime=24:00:00
The model is fitted with the argumnets: workers=8, use_multiprocessing=True and also with an early stopping. 
I've been following the training process and upon reaching early stopping, the node starts to hang.
I ran the script on a different computational server and in seems to run with no problem. 
Any idea what might cause it and how to fix this problem?

0 Kudos
3 Replies
JananiC_Intel
Moderator
98 Views

Hi,


Thanks for posting in Intel forums.


We tried your command and we were not able to enter the computational node.Try the below command and let us know the update.


qsub -I -l nodes=1:gpu:ppn=2 -l walltime=24:00:00


Regards,

Janani Chandran




JananiC_Intel
Moderator
88 Views

Hi,


Is the issue resolved?Do you have any update?


Regards,

Janani Chnadran


JananiC_Intel
Moderator
73 Views

Hi,


We are assuming that the solution provided helped and would no longer be monitoring this issue.Please raise a new thread if you have any further issues.


Regards,

Janani Chandran


Reply