Re: Training Optimisation in Caffe

GMath7 · ‎01-16-2019

Hi,

Is it enough to add the following lines in my JobScript for faster optimised training in Caffe

#PBS -l nodes=8:skl

cd $PBS_O_WORKDIR

cd caffe

echo Training Started...

TOOLS=./build/tools

export OMP_NUM_THREADS=NUM_PARALLEL_EXEC_UNITS

export KMP_BLOCKTIME="0"

export KMP_SETTINGS="1"

export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"

mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \

$TOOLS/caffe train --solver=/home/u22845/solver.prototxt -weights /home/u22845/PretrainedModel/20170127-121555-da41_epoch_30.0/snapshot_iter_63540.caffemodel $@

Ansif_M_Intel · ‎01-16-2019

Hi Gina, Thank you for reaching out to us. From the details provided , we can infer that there are some hyper parameter tuning in your script.Optimization depends on the topology,batch size etc. if you can provide us more details on the same we will be able to provide better insights on the optimization. Regards, Ansif M

GMath7 · ‎01-16-2019

Hi, We used Detectnet model with Batch size 1. Regards, Gina On Wed, 16 Jan 2019, 7:18 pm Intel Forums <supportreplies@intel.com wrote:

GMath7 · ‎01-16-2019

Hi Ansif We used Deectnet model as given in the link https://devblogs.nvidia.com/detectnet-deep-neural-network-object-detection-digits/ Batch size 1 On Wed, 16 Jan 2019, 7:18 pm Intel Forums <supportreplies@intel.com wrote:

Ansif_M_Intel · ‎01-17-2019

Hi Gina, Even though in the job script given #PBS -l nodes=8:skl, if the code doesn't have the components it may not make use of 8 nodes. In the mpirun command n(number of process) equals 2 and ppn(process per node) equals 2,which implies two process are running in one node. which is contradictory to the first line. Furthermore specify the 8 nodes which you wish to allocate. Try setting NUM_PARALLEL_EXEC_UNITS to 12. Kindly check and let us know on if the above changes helped. Regards, Ansif.M

GMath7 · ‎01-17-2019

Hi Ansif, Thank you for you reply. But I didnt understand by this line *Even though in the job script given #PBS <https://forums.intel.com/_ui/core/feeds/notification/TopicLandingPage?name=PBS&ref=hash_mention&fromEmail=1&s1oid=00DU0000000YT3c&s1nid=0DB0P000000U1Hq&s1uid=0050P000008fzjq&s1ext=0&emkind=chatterCommentNotification&emtm=1547730234937> -l nodes=8:skl, if the code doesn't have the components it may not make use of 8 nodes.* Is it the code? My requirement is to train caffe model. So for that I execute the caffe train command only. So how should I change to get the 8 nodes for training

GMath7 · ‎01-17-2019

Hi Asif. Cafe build was Inte distribution cafe with the following in Make file.config USE_MKLDNN_AS_DEFAULT_ENGINE := 1 USE_MLSL := 1 BLAS := mkl On Thu, 17 Jan 2019, 7:02 pm Gina Mathew <ginammathew@gmail.com wrote:

GMath7 · ‎01-18-2019

Hi Ansif, I tried to train my caffe model in SSH terminal through Interactive mode by giving the command qsub -I -l nodes=8:skl. In the interactive mode the below command was given TOOLS=./build/tools mpirun -machinefile $PBS_NODEFILE -l -n 8 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \ > $TOOLS/caffe train --solver=/home/u22845/solver.prototxt -engine "MKL2017" -weights /home/u22845/PretrainedModel/20170127-121555-da41_epoch_30.0/snapshot_iter_63540.caffemodel 2>&1 | tee /home/u22845/caffe.log But i get error as execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory) Attaching screenshot of error also. Could you please support regards, Gina Mathew

Ansif_M_Intel · ‎01-18-2019

Looks like the detectnet toplogy is optimized for GPUs , whereas Intel DevCloud consists of high performing Xeon CPUs. Hence, the performance of this topology on Xeon servers needs to be checked. execvp error usually occurs when caffe is not set up on all the assigned nodes . We are working on recreating the issue at our end. Please share the exact steps required to recreate the same.

GMath7 · ‎01-18-2019

Hi Ansif, OK. *"execvp error usually occurs when caffe is not set up on all the assigned nodes ."* I have installed caffe in my user account in devcloud. But how will I know in which node I installed Caffe. So is it that I need to login to each node and do installation? regards, Gina On Fri, 18 Jan 2019, 5:19 pm Intel Forums <supportreplies@intel.com wrote:

GMath7 · ‎01-18-2019

Hi Ansif, Each time when logged in, a different node will be acquired. And we could run the model in that node.

Ansif_M_Intel · ‎01-21-2019

we are trying to recreate your issue from our end, will update you in two days.

GMath7 · ‎01-23-2019

Hi Ansif Were you able to check how to do caffe training on multinode? On Mon, 21 Jan 2019, 7:01 pm Intel Forums <supportreplies@intel.com wrote:

Ansif_M_Intel · ‎01-24-2019

There are some issues that we are trying to resolve while recreating the error, due to which it might take a day or two. We will keep you posted on the progress

Ansif_M_Intel · ‎01-29-2019

Sincere apologies for the delay, we tried multi node training using CIFAR-10 dataset, and was working fine. Since the error mentioned is same as in the thread: https://forums.intel.com/s/question/0D50P00004C5QnySAF/training-caffe-model-in-devcloud?language=en_US Could you please refer the above thread and close this thread.

Ansif_M_Intel · ‎01-31-2019

Since there is no reply from your end, we are closing this thread . Please open a new thread if you face further issues.