- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Is it enough to add the following lines in my JobScript for faster optimised training in Caffe
#PBS -l nodes=8:skl
cd $PBS_O_WORKDIR
cd caffe
echo Training Started...
TOOLS=./build/tools
export OMP_NUM_THREADS=NUM_PARALLEL_EXEC_UNITS
export KMP_BLOCKTIME="0"
export KMP_SETTINGS="1"
export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"
mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
$TOOLS/caffe train --solver=/home/u22845/solver.prototxt -weights /home/u22845/PretrainedModel/20170127-121555-da41_epoch_30.0/snapshot_iter_63540.caffemodel $@
- Tags:
- PBS
Link Copied
15 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gina,
Thank you for reaching out to us.
From the details provided , we can infer that there are some hyper parameter tuning in your script.Optimization depends on the topology,batch size etc.
if you can provide us more details on the same we will be able to provide better insights on the optimization.
Regards,
Ansif M
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We used Detectnet model with Batch size 1.
Regards,
Gina
On Wed, 16 Jan 2019, 7:18 pm Intel Forums <supportreplies@intel.com wrote:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ansif
We used Deectnet model as given in the link
https://devblogs.nvidia.com/detectnet-deep-neural-network-object-detection-digits/
Batch size 1
On Wed, 16 Jan 2019, 7:18 pm Intel Forums <supportreplies@intel.com wrote:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gina,
Even though in the job script given #PBS -l nodes=8:skl, if the code doesn't have the components it may not make use of 8 nodes.
In the mpirun command n(number of process) equals 2 and ppn(process per node) equals 2,which implies two process are running in one node.
which is contradictory to the first line. Furthermore specify the 8 nodes which you wish to allocate.
Try setting NUM_PARALLEL_EXEC_UNITS to 12.
Kindly check and let us know on if the above changes helped.
Regards,
Ansif.M
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ansif,
Thank you for you reply. But I didnt understand by this line
*Even though in the job script given #PBS
<https://forums.intel.com/_ui/core/feeds/notification/TopicLandingPage?name=PBS&ref=hash_mention&fromEmail=1&s1oid=00DU0000000YT3c&s1nid=0DB0P000000U1Hq&s1uid=0050P000008fzjq&s1ext=0&emkind=chatterCommentNotification&emtm=1547730234937>
-l nodes=8:skl, if the code doesn't have the components it may not make use
of 8 nodes.*
Is it the code? My requirement is to train caffe model. So for that I
execute the caffe train command only. So how should I change to get the 8
nodes for training
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Asif.
Cafe build was Inte distribution cafe with the following in Make file.config
USE_MKLDNN_AS_DEFAULT_ENGINE := 1
USE_MLSL := 1
BLAS := mkl
On Thu, 17 Jan 2019, 7:02 pm Gina Mathew <ginammathew@gmail.com wrote:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ansif,
I tried to train my caffe model in SSH terminal through Interactive mode by giving the command qsub -I -l nodes=8:skl.
In the interactive mode the below command was given
TOOLS=./build/tools
mpirun -machinefile $PBS_NODEFILE -l -n 8 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
> $TOOLS/caffe train --solver=/home/u22845/solver.prototxt -engine "MKL2017" -weights /home/u22845/PretrainedModel/20170127-121555-da41_epoch_30.0/snapshot_iter_63540.caffemodel 2>&1 | tee /home/u22845/caffe.log
But i get error as
execvp error on file FOUNDED_MLSL_ROOT/intel64/bin/ep_server (No such file or directory)
Attaching screenshot of error also. Could you please support
regards,
Gina Mathew
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looks like the detectnet toplogy is optimized for GPUs , whereas Intel DevCloud consists of high performing Xeon CPUs. Hence, the performance of this topology on Xeon servers needs to be checked.
execvp error usually occurs when caffe is not set up on all the assigned nodes .
We are working on recreating the issue at our end. Please share the exact steps required to recreate the same.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ansif,
OK.
*"execvp error usually occurs when caffe is not set up on all the assigned
nodes ."*
I have installed caffe in my user account in devcloud. But how will I know
in which node I installed Caffe. So is it that I need to login to each node
and do installation?
regards,
Gina
On Fri, 18 Jan 2019, 5:19 pm Intel Forums <supportreplies@intel.com wrote:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ansif,
Each time when logged in, a different node will be acquired. And we could run the model in that node.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
we are trying to recreate your issue from our end, will update you in two days.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ansif
Were you able to check how to do caffe training on multinode?
On Mon, 21 Jan 2019, 7:01 pm Intel Forums <supportreplies@intel.com wrote:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are some issues that we are trying to resolve while recreating the error, due to which it might take a day or two.
We will keep you posted on the progress
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sincere apologies for the delay, we tried multi node training using CIFAR-10 dataset, and was working fine.
Since the error mentioned is same as in the thread:
https://forums.intel.com/s/question/0D50P00004C5QnySAF/training-caffe-model-in-devcloud?language=en_US
Could you please refer the above thread and close this thread.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since there is no reply from your end, we are closing this thread .
Please open a new thread if you face further issues.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page