Training caffe model in DevCloud

GMath7 · ‎01-16-2019

Hi,

I need to train a caffe model in DevCloud. I presume that training our data needs more than 24hours and my Jupyter session is 4 hours only. So I tried to train data from Linux SSH terminal and submitted job through batch qsub mode. While training is proceeding in DevCloud after giving a walltime of 24 hours, I am not able to get any snapshot of trained caffe model. I can get a snapshot of model completing within 4 hours. But after 4hours, I am not getting trained model. Please check my JobScript I attach alongwith. Could you please support.

How I can get a snapshot of caffe model saved by submitting job in batch qsub mode from Login node from Linux SSH terminal when I need to train model for more than 24 hours.
Also how is it possible to move to interactive mode from login node from Linux SSH terminal with walltime of 24 hours.
How is it possible to change to compute node from login node in Linux SSH terminal

Surya_R_Intel · ‎01-16-2019

Hi Gina, Thank you for reaching out to us. In the script you shared, the walltime is set to 20minutes (#PBS -l walltime=00:20:00). Kindly change it to 24hrs (#PBS -l walltime=24:00:00). Please find the suggestions for your queries. 1. Could you please check the snapshot value given in solver.prototxt. This is just to confirm that if you have provided a higher value, it may takes more than 24 hours and the model may not be saved at all within 24 hrs. Try to give a small value, if a larger value was given. 2. We can create a job with walltime set to 24 hours. By submitting a job in qsub mode from login node, it will check the availability of compute nodes and will execute the job in compute node. 3. To change to compute node from login node use the below command. qsub -I Please feel free to get back to us in case of any further issues. Regards, Surya

GMath7 · ‎01-16-2019

Hi Surya, I tried by giving walltime by 24hr and 20 minutes also. Also changed the snapshot to 10. Still no caffemodel writes when job was submitted in batch qsub mode. Regarding (3) from your answer, I understand that qsub -I will provide default walltime of 6hours. Is it possible to extend to 24hrs in interactive mode. regards, Gina

Surya_R_Intel · ‎01-16-2019

Hi Gina, Since you are telling the caffe model is not getting saved, could you please try the below two suggestions. 1. Please use the below command to extend the walltime to 24hrs in interactive mode. qsub -I -l walltime=24:00:00 2. Kindly check the status of the submitted job using the below procedure. a) Use the qstat command to get the job id details. After that, obtain the execution node of the job using the below command. qstat -xf <job_id> b) ssh to the compute node and use top command to see the if the job is still running. For eg: <exec_host>c009-n034/0-1</exec_host> ssh c009-n034 c) Use top to view if the python program is still running. Even then if you are facing the same issue, kindly share the screenshots of the logs. Regards, Surya

Surya_R_Intel · ‎01-17-2019

Hi Gina, Could you please confirm if the solution provided worked for you. Regards, Surya

GMath7 · ‎01-17-2019

Hi Surya, 1. In batch mode it is not saving the caffe model snapshot. Job ID is 16453 for the model training task. Adding the log output you have asked to sent. Also attached screenshot of output after top command and qstat -f <JOB ID> command. Could you please check. [u22845@c002-n006 ~]$ qstat Job ID Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 16444.c002 launch u22845 00:00:00 R batch 16453.c002 launch u22845 01:10:43 R batch 16455.c002 ...ub-singleuser u22845 00:00:47 R jupyterhub [u22845@c002-n006 ~]$ qstat -xf 16453 <Data><Job><Job_Id>16453.c002</Job_Id><Job_Name>launch</Job_Name><Job_Owner> [email protected] </Job_Owner><resources_used><cput>01:13:42</cput><energy_used>0</energy_used><mem>107140kb</mem><vmem>6412032kb</vmem><walltime>01:13:55</walltime></resources_used><job_state>R</job_state><queue>batch</queue><server>c002</server><Checkpoint>u</Checkpoint><ctime>1547724012</ctime><Error_Path>c002:/home/u22845/launch.e16453</Error_Path><exec_host>c002-n014/0+c002-n015/0+c002-n016/0+c002-n017/0+c002-n020/0+c002-n025/0+c002-n026/0+c002-n027/0</exec_host><Hold_Types>n</Hold_Types><Join_Path>n</Join_Path><Keep_Files>n</Keep_Files><Mail_Points>n</Mail_Points><mtime>1547724013</mtime><Output_Path>c002:/home/u22845/launch.o16453</Output_Path><Priority>0</Priority><qtime>1547724012</qtime><Rerunable>True</Rerunable><Resource_List><nodect>8</nodect><nodes>8:skl</nodes><walltime>24:00:00</walltime></Resource_List><session_id>347855</session_id><Variable_List>PBS_O_QUEUE=batch,PBS_O_HOME=/home/u22845,PBS_O_LOGNAME=u22845,PBS_O_PATH=/home/u22845/.conda/envs/caffe_build_py27/bin:/glob/intel-python/python3/bin/:/glob/intel-python/python2/bin/:/glob/development-tools/versions/intel-parallel-studio-2018-update3/compilers_and_libraries_2018.3.222/linux/bin/intel64:/glob/development-tools/versions/intel-parallel-studio-2018-update3/compilers_and_libraries_2018.3.222/linux/mpi/intel64/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/u22845/.local/bin:/home/u22845/bin,PBS_O_MAIL=/var/spool/mail/u22845,PBS_O_SHELL=/bin/bash,PBS_O_LANG=en_IN,PBS_O_SUBMIT_FILTER=/usr/local/sbin/torque_submitfilter,PBS_O_WORKDIR=/home/u22845,PBS_O_HOST=c002,PBS_O_SERVER=c002</Variable_List><euser>u22845</euser><egroup>u22845</egroup><queue_type>E</queue_type><etime>1547724012</etime><submit_args>launch</submit_args><start_time>1547724013</start_time><Walltime><Remaining>81913</Remaining></Walltime><start_count>1</start_count><fault_tolerant>False</fault_tolerant><job_radix>0</job_radix><submit_host>c002</submit_host></Job></Data> 2. In Interative mode I could get walltime of 24hours following your suggestion. But how could I use it do multinode training. In script file I had given #PBS -l nodes=8:skl. But how I could use the same in interactive mode.

Surya_R_Intel · ‎01-18-2019

Glad to know that you are getting a wall time of 24hours in interactive mode. Kindly use the below command in interactive mode to do multinode training. qsub -I -l nodes=8:skl:ppn=2 walltime=24:00:00 By any chance if you get the error, 'qsub: submit error (Job exceeds queue resource limits MSG=job violates queue/server max resource limits)', it means the specified number of nodes are not available. Kindly reduce the number of nodes and try the experiment.

GMath7 · ‎01-18-2019

Hi Surya, OK Thank you. I tried to run caffe training in multiple nodes. But I get the error as shown in screenshot. Could you please check On Fri, 18 Jan 2019, 5:35 pm Intel Forums <[email protected] wrote:

Surya_R_Intel · ‎01-21-2019

Thank you for the reply. We are not able to find the screenshot as mentioned in the previous response. Kindly provide us more information to help you on the issue.

GMath7 · ‎01-22-2019

Hi Surya, Adding sceenshot below. [image: error_cluster.png]

Surya_R_Intel · ‎01-22-2019

Sorry to let you know that we are still not able to view the screenshot. Kindly make sure that you are attaching the file.

GMath7 · ‎01-22-2019

Hi Surya, Please find the error getting below. Also attaching its screenshot . Could you please check [image: error_cluster.png]

Surya_R_Intel · ‎01-22-2019

We are still not able to view the error logs and the attached screenshot. However we are able to view the script file shared in the first response. Could you please try to share the error logs in the similar way you have attached the script file.

GMath7 · ‎01-22-2019

Hi Surya, But the .jpg file was attached in the same way as script file. Attaching the file again fro your reference. Error log adding below [0] [proxy:1:0@c002-n010] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory) [1] [proxy:1:0@c002-n010] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory) [2] [proxy:1:0@c002-n010] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory) [3] [proxy:1:0@c002-n010] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory) [mpiexec@c002-n010] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor) [mpiexec@c002-n010] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy [5] [proxy:1:1@c002-n011] HYDU_create_process (../../utils/launch/launch.c:825): [5] execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory) [7] [proxy:1:1@c002-n011] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory) [4] [proxy:1:1@c002-n011] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory) [6] [proxy:1:1@c002-n011] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory) regards, Gina Mathew

GMath7 · ‎01-23-2019

Hi Surya, Were u able to check On Tue, 22 Jan 2019, 6:02 pm Gina Mathew <[email protected] wrote:

Surya_R_Intel · ‎01-23-2019

Regarding the error, we are facing few issues in recreating the error due to which it might take a day or two. We will keep you posted on the updates. We note that the following thread (https://forums.intel.com/s/question/0D50P00004C5fb2SAB/training-optimisation-in-caffe?language=en_US) seems to be the duplicate of this thread. The error seems to be the same, can we go ahead and close the other thread.

Surya_R_Intel · ‎01-29-2019

Extremely sorry for the delay. Communication between nodes using conda environment is facing additional challenges. Please try using the caffe available in /glob/intel-python/python2/bin/. We have verified multinode training using a CIFAR10 example and it is working fine. Add or change the following line in the '~/.bash_profile' script. export PATH=/glob/intel-python/python2/bin/:$PATH Run the following for the changes to take effect: source ~/.bash_profile Kindly submit the script as job for getting 24hrs walltime. Below is an example for the job script to do multinode training in DevCloud. #PBS -l walltime=24:00:00 #PBS -l nodes=8:skl cd $PBS_O_WORKDIR mpirun --machinefile $PBS_NODEFILE -n 8 caffe train --solver=./examples/cifar10/cifar10_full_solver.prototxt Kindly get back to us in case of any queries.

Surya_R_Intel · ‎01-31-2019

Could you please confirm if the solution provided worked for you.

Surya_R_Intel · ‎02-01-2019

Since we didn't get any response from your end, we are closing this thread. Kindly open a new thread if you face further issues.