- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I need to train a caffe model in DevCloud. I presume that training our data needs more than 24hours and my Jupyter session is 4 hours only. So I tried to train data from Linux SSH terminal and submitted job through batch qsub mode. While training is proceeding in DevCloud after giving a walltime of 24 hours, I am not able to get any snapshot of trained caffe model. I can get a snapshot of model completing within 4 hours. But after 4hours, I am not getting trained model. Please check my JobScript I attach alongwith. Could you please support.
- How I can get a snapshot of caffe model saved by submitting job in batch qsub mode from Login node from Linux SSH terminal when I need to train model for more than 24 hours.
- Also how is it possible to move to interactive mode from login node from Linux SSH terminal with walltime of 24 hours.
- How is it possible to change to compute node from login node in Linux SSH terminal
- Tags:
- PBS
Link Copied
		18 Replies
	
		
		
			
			
			
					
	
			- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi Gina,
Thank you for reaching out to us.
In the script you shared, the walltime is set to 20minutes (#PBS -l walltime=00:20:00). Kindly change it to 24hrs (#PBS -l walltime=24:00:00).
Please find the suggestions for your queries. 
1.  Could you please check the snapshot value given in solver.prototxt. This is just to confirm that if you have provided a higher value, it may takes more than 24 hours and the model may not be saved at all within 24 hrs. Try to give a small value, if a larger value was given.
2.   We can create a job with walltime set to 24 hours. By submitting a job in qsub mode from login node, it will check the availability of compute nodes and will execute the job in compute node.
3.  To change to compute node from login node use the below command.
	qsub -I
	
Please feel free to get back to us in case of any further issues.
Regards,
Surya
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi Surya,
I tried by giving walltime by 24hr and 20 minutes also. Also changed the snapshot to 10. Still no caffemodel writes when job was submitted in batch qsub mode. Regarding (3) from your answer, I understand that qsub -I will provide default walltime of 6hours. Is it possible to extend to 24hrs in interactive  mode.
regards,
Gina
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi Gina,
Since you are telling the caffe model is not getting saved, could you please try the below two suggestions.
1. Please use the below command to extend the walltime to 24hrs in interactive mode.
	qsub -I -l walltime=24:00:00
2. Kindly check the status of the submitted job using the below procedure.
a) Use the qstat command to get the job id details. After that, obtain the execution node of the job using the below command.
	qstat -xf <job_id>
b) ssh to the compute node and use top command to see the if the job is still running.
For eg: <exec_host>c009-n034/0-1</exec_host>
	ssh c009-n034
c) Use top to view if the python program is still running.
Even then if you are facing the same issue, kindly share the screenshots of the logs.
Regards,
Surya
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi Gina,
Could you please confirm if the solution provided worked for you.
Regards,
Surya
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi Surya,
1. In batch mode  it is not saving the caffe model snapshot. Job ID is 16453 for the model training task. Adding the log output you have asked to sent. Also attached screenshot of output after top command and qstat -f <JOB ID> command. Could you please check.
[u22845@c002-n006 ~]$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
16444.c002                 launch           u22845          00:00:00 R batch
16453.c002                 launch           u22845          01:10:43 R batch
16455.c002                 ...ub-singleuser u22845          00:00:47 R jupyterhub
[u22845@c002-n006 ~]$ qstat -xf 16453
<Data><Job><Job_Id>16453.c002</Job_Id><Job_Name>launch</Job_Name><Job_Owner>
u22845@c002.colfaxresearch.com
</Job_Owner><resources_used><cput>01:13:42</cput><energy_used>0</energy_used><mem>107140kb</mem><vmem>6412032kb</vmem><walltime>01:13:55</walltime></resources_used><job_state>R</job_state><queue>batch</queue><server>c002</server><Checkpoint>u</Checkpoint><ctime>1547724012</ctime><Error_Path>c002:/home/u22845/launch.e16453</Error_Path><exec_host>c002-n014/0+c002-n015/0+c002-n016/0+c002-n017/0+c002-n020/0+c002-n025/0+c002-n026/0+c002-n027/0</exec_host><Hold_Types>n</Hold_Types><Join_Path>n</Join_Path><Keep_Files>n</Keep_Files><Mail_Points>n</Mail_Points><mtime>1547724013</mtime><Output_Path>c002:/home/u22845/launch.o16453</Output_Path><Priority>0</Priority><qtime>1547724012</qtime><Rerunable>True</Rerunable><Resource_List><nodect>8</nodect><nodes>8:skl</nodes><walltime>24:00:00</walltime></Resource_List><session_id>347855</session_id><Variable_List>PBS_O_QUEUE=batch,PBS_O_HOME=/home/u22845,PBS_O_LOGNAME=u22845,PBS_O_PATH=/home/u22845/.conda/envs/caffe_build_py27/bin:/glob/intel-python/python3/bin/:/glob/intel-python/python2/bin/:/glob/development-tools/versions/intel-parallel-studio-2018-update3/compilers_and_libraries_2018.3.222/linux/bin/intel64:/glob/development-tools/versions/intel-parallel-studio-2018-update3/compilers_and_libraries_2018.3.222/linux/mpi/intel64/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/u22845/.local/bin:/home/u22845/bin,PBS_O_MAIL=/var/spool/mail/u22845,PBS_O_SHELL=/bin/bash,PBS_O_LANG=en_IN,PBS_O_SUBMIT_FILTER=/usr/local/sbin/torque_submitfilter,PBS_O_WORKDIR=/home/u22845,PBS_O_HOST=c002,PBS_O_SERVER=c002</Variable_List><euser>u22845</euser><egroup>u22845</egroup><queue_type>E</queue_type><etime>1547724012</etime><submit_args>launch</submit_args><start_time>1547724013</start_time><Walltime><Remaining>81913</Remaining></Walltime><start_count>1</start_count><fault_tolerant>False</fault_tolerant><job_radix>0</job_radix><submit_host>c002</submit_host></Job></Data>
2. In Interative mode I could get walltime of 24hours following your
suggestion. But how could I use it do multinode training. In script file I
had given #PBS -l nodes=8:skl. But how I could use the same in interactive
mode.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Glad to know that you are getting a wall time of 24hours in interactive mode. Kindly use the below command in interactive mode to do multinode training.
     qsub -I -l nodes=8:skl:ppn=2 walltime=24:00:00
By any chance if you get the error, 'qsub: submit error (Job exceeds queue resource limits MSG=job violates queue/server max resource limits)', it means the specified number of nodes are not available. 
Kindly reduce the number of nodes and try the experiment.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi Surya,
OK Thank you. I tried to run caffe training  in multiple nodes. But I get the error as shown in screenshot. Could you please check
On Fri, 18 Jan 2019, 5:35 pm Intel Forums <supportreplies@intel.com wrote:
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Thank you for the reply. We are not able to find the screenshot as mentioned  in the previous response. Kindly provide us more information to help you on the issue.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi Surya,
Adding sceenshot below.
[image: error_cluster.png]
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Sorry to let you know that we are still not able to view the screenshot. Kindly make sure that you are attaching the file.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi Surya,
Please find the error getting below. Also attaching its screenshot . Could you please check
[image: error_cluster.png]
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						We are still not able to view the error logs and the attached screenshot. However we are able to view the script file shared in the first response. Could you please try to share the error logs in the similar way you have attached the script file.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi Surya,
But the .jpg file was attached in the same way as script file. Attaching the file again fro your reference. Error log adding below
[0] [proxy:1:0@c002-n010] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory)
[1] [proxy:1:0@c002-n010] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory)
[2] [proxy:1:0@c002-n010] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory)
[3] [proxy:1:0@c002-n010] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory)
[mpiexec@c002-n010] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@c002-n010] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy
[5] [proxy:1:1@c002-n011] HYDU_create_process (../../utils/launch/launch.c:825): [5] execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory)
[7] [proxy:1:1@c002-n011] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory)
[4] [proxy:1:1@c002-n011] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory)
[6] [proxy:1:1@c002-n011] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory)
regards,
Gina Mathew
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Hi Surya,
Were u able to check
On Tue, 22 Jan 2019, 6:02 pm Gina Mathew <ginammathew@gmail.com wrote:
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Regarding the error, we are facing few issues in recreating the error due to which it might take a day or two.
We will keep you posted on the updates.
We note that the following thread (https://forums.intel.com/s/question/0D50P00004C5fb2SAB/training-optimisation-in-caffe?language=en_US) seems to be the duplicate of this thread. The error seems to be the same, can we go ahead and close the other thread.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Extremely sorry for the delay. Communication between nodes using conda environment is facing additional challenges. Please try using the caffe available in /glob/intel-python/python2/bin/. 
We have verified multinode training using a CIFAR10 example and it is working fine.
Add or change the following line in the '~/.bash_profile' script.
	export PATH=/glob/intel-python/python2/bin/:$PATH
Run the following for the changes to take effect:
	source ~/.bash_profile
	
Kindly submit the script as job for getting 24hrs walltime.
Below is an example for the job script to do multinode training in DevCloud.
#PBS -l walltime=24:00:00
#PBS -l nodes=8:skl
cd $PBS_O_WORKDIR
mpirun --machinefile $PBS_NODEFILE -n 8 caffe train --solver=./examples/cifar10/cifar10_full_solver.prototxt
Kindly get back to us in case of any queries.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Could you please confirm if the solution provided worked for you.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
			
				
					
					
						Since we didn't get any response from your end, we are closing this thread. 
Kindly open a new thread if you face further issues.
					
				
			
			
				
			
			
			
			
			
			
			
		
		
		
	
	
	
 
					
				
				
			
		
					
					Reply
					
						
	
		
				
				
				
					
						
					
				
					
				
				
				
				
			
			Topic Options
			
				
					
	
			
		
	- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page