Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Training Optimisation in Caffe

GMath7
Beginner
2,431 Views

Hi,

 

Is it enough to add the following lines in my JobScript for faster optimised training in Caffe

 

#PBS -l nodes=8:skl

cd $PBS_O_WORKDIR

cd caffe

echo Training Started...

TOOLS=./build/tools

export OMP_NUM_THREADS=NUM_PARALLEL_EXEC_UNITS

export KMP_BLOCKTIME="0"

export KMP_SETTINGS="1"

export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"

mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \

$TOOLS/caffe train --solver=/home/u22845/solver.prototxt -weights /home/u22845/PretrainedModel/20170127-121555-da41_epoch_30.0/snapshot_iter_63540.caffemodel $@

0 Kudos
15 Replies
Ansif_M_Intel
Employee
1,178 Views
Hi Gina, Thank you for reaching out to us. From the details provided , we can infer that there are some hyper parameter tuning in your script.Optimization depends on the topology,batch size etc. if you can provide us more details on the same we will be able to provide better insights on the optimization. Regards, Ansif M
0 Kudos
GMath7
Beginner
1,178 Views
Hi, We used Detectnet model with Batch size 1. Regards, Gina On Wed, 16 Jan 2019, 7:18 pm Intel Forums <supportreplies@intel.com wrote:
0 Kudos
GMath7
Beginner
1,178 Views
Hi Ansif We used Deectnet model as given in the link https://devblogs.nvidia.com/detectnet-deep-neural-network-object-detection-digits/ Batch size 1 On Wed, 16 Jan 2019, 7:18 pm Intel Forums <supportreplies@intel.com wrote:
0 Kudos
Ansif_M_Intel
Employee
1,178 Views
Hi Gina, Even though in the job script given #PBS -l nodes=8:skl, if the code doesn't have the components it may not make use of 8 nodes. In the mpirun command n(number of process) equals 2 and ppn(process per node) equals 2,which implies two process are running in one node. which is contradictory to the first line. Furthermore specify the 8 nodes which you wish to allocate. Try setting NUM_PARALLEL_EXEC_UNITS to 12. Kindly check and let us know on if the above changes helped. Regards, Ansif.M
0 Kudos
GMath7
Beginner
1,178 Views
Hi Ansif, Thank you for you reply. But I didnt understand by this line *Even though in the job script given #PBS <https://forums.intel.com/_ui/core/feeds/notification/TopicLandingPage?name=PBS&ref=hash_mention&fromEmail=1&s1oid=00DU0000000YT3c&s1nid=0DB0P000000U1Hq&s1uid=0050P000008fzjq&s1ext=0&emkind=chatterCommentNotification&emtm=1547730234937> -l nodes=8:skl, if the code doesn't have the components it may not make use of 8 nodes.* Is it the code? My requirement is to train caffe model. So for that I execute the caffe train command only. So how should I change to get the 8 nodes for training
0 Kudos
GMath7
Beginner
1,178 Views
Hi Asif. Cafe build was Inte distribution cafe with the following in Make file.config USE_MKLDNN_AS_DEFAULT_ENGINE := 1 USE_MLSL := 1 BLAS := mkl On Thu, 17 Jan 2019, 7:02 pm Gina Mathew <ginammathew@gmail.com wrote:
0 Kudos
GMath7
Beginner
1,178 Views
Hi Ansif, I tried to train my caffe model in SSH terminal through Interactive mode by giving the command qsub -I -l nodes=8:skl. In the interactive mode the below command was given TOOLS=./build/tools mpirun -machinefile $PBS_NODEFILE -l -n 8 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \ > $TOOLS/caffe train --solver=/home/u22845/solver.prototxt -engine "MKL2017" -weights /home/u22845/PretrainedModel/20170127-121555-da41_epoch_30.0/snapshot_iter_63540.caffemodel 2>&1 | tee /home/u22845/caffe.log But i get error as execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory) Attaching screenshot of error also. Could you please support regards, Gina Mathew
0 Kudos
Ansif_M_Intel
Employee
1,178 Views
Looks like the detectnet toplogy is optimized for GPUs , whereas Intel DevCloud consists of high performing Xeon CPUs. Hence, the performance of this topology on Xeon servers needs to be checked. execvp error usually occurs when caffe is not set up on all the assigned nodes . We are working on recreating the issue at our end. Please share the exact steps required to recreate the same.
0 Kudos
GMath7
Beginner
1,178 Views
Hi Ansif, OK. *"execvp error usually occurs when caffe is not set up on all the assigned nodes ."* I have installed caffe in my user account in devcloud. But how will I know in which node I installed Caffe. So is it that I need to login to each node and do installation? regards, Gina On Fri, 18 Jan 2019, 5:19 pm Intel Forums <supportreplies@intel.com wrote:
0 Kudos
GMath7
Beginner
1,178 Views
Hi Ansif, Each time when logged in, a different node will be acquired. And we could run the model in that node.
0 Kudos
Ansif_M_Intel
Employee
1,178 Views
we are trying to recreate your issue from our end, will update you in two days.
0 Kudos
GMath7
Beginner
1,178 Views
Hi Ansif Were you able to check how to do caffe training on multinode? On Mon, 21 Jan 2019, 7:01 pm Intel Forums <supportreplies@intel.com wrote:
0 Kudos
Ansif_M_Intel
Employee
1,178 Views
There are some issues that we are trying to resolve while recreating the error, due to which it might take a day or two. We will keep you posted on the progress
0 Kudos
Ansif_M_Intel
Employee
1,178 Views
Sincere apologies for the delay, we tried multi node training using CIFAR-10 dataset, and was working fine. Since the error mentioned is same as in the thread: https://forums.intel.com/s/question/0D50P00004C5QnySAF/training-caffe-model-in-devcloud?language=en_US Could you please refer the above thread and close this thread.
0 Kudos
Ansif_M_Intel
Employee
1,178 Views
Since there is no reply from your end, we are closing this thread . Please open a new thread if you face further issues.
0 Kudos
Reply