Training data with caffe and python in Colfax (terminal): Do you have an example?

CBell1 · ‎09-07-2018

Hello,

I have some doubts about how to define the CAFFE_PATH= and the data path I have to write within Phyton program to training data in my Colfax Home directory.

Do you have an example?

Thank you

idata · ‎09-10-2018

Hi Cosma,

Thanks for reaching out to us.

We can train caffe within the python program. A sample python program is given below.

import caffe import os solver=caffe.SGDSolver('/models/intel_optimized_models/ssd/VGGNet/VOC0712/SSD_300x300/solver.prototxt') solver.solve()

Here we have used ssd solver.prototxt from intel caffe. Please specify your solver.prototxt path.

Need to comment out the below line in solver.prototxt.

type: "SGD"

Kindly revert in case of any issues.

Regards,

Deepthi Raj.

idata · ‎09-11-2018

Hi Cosma,

Could you please confirm if the solution provided worked for you.

Regards,

Deepthi Raj

CBell1 · ‎09-11-2018

Hi Deepthi Raj,

I'm sorry for the late answer.

Thank you for your suggestion, I used your example without changes using the following :

/glob/intel-python/python3/bin/caffe/

I received the following error:

"

WARNING: Logging before InitGoogleLogging() is written to STDERR

F0911 10:30:07.087224 129990 io.cpp:82] Check failed: fd != -1 (-1 vs. -1) File$

*** Check failure stack trace: ***

/var/spool/torque/mom_priv/jobs/http://162226.c009.SC/ 162226.c009.SC: line 3: 129990 Aborted

"

I cannot acces the directory ../jobs: permission denied.

What could be wrong?

Additional questions:

1 - I would like to use my solver.prototxt, in witch folder do I have to move it?

'/models/intel_optimized_models/ssd/VGGNet/

2 - Is it possible to use GoogLeNet model?

FYI: Sorry, unfortunately, I will can not answer before Thursday.

Thank you for your support!

Best Regards

idata · ‎09-12-2018

Hi Cosma,

Please specify your solver.prototxt path instead of "/models/intel_optimized_models/ssd/VGGNet/VOC0712/SSD_300x300/solver.prototxt"

We can use googlenet topology as well for caffe training, but need to provide corresponding solver.prototxt

Regards,

Dona

CBell1 · ‎09-13-2018

Hi Dona,

the workaround works, the Systems started initializing the Solver using my parameters, thank you!

Now, there are 2 errors as output saved within .py.e12345 file I would like to clarify:

1 - At the beginning of the process:

WARNING: Logging before InitGoogleLogging() is written to STDERR

2 - at the end of the process:

/var/spool/torque/mom_priv/jobs/163020.c009.SC: line 3: 95903 Bus error python3 myapplication1.py

How can I fix these?

Thank you

idata · ‎09-13-2018

Hi Cosma,

Please share the python file and the dependent files so that we could run and check it from our side.

Regards,

Deepthi Raj

CBell1 · ‎09-13-2018

OK,

please see files attached.

Thank you

idata · ‎09-14-2018

Hi Cosma,

Thanks for sharing.

We need the python file that you used to run it from our side. Please share the below mentioned files.

1. myapplication2.py

2. train_val.prototxt

3. train and test lmdb files

4. labelmap.prototxt

5. test_name_size.txt

Regards,

Deepthi Raj

CBell1 · ‎09-14-2018

Hi Deepthi Raj,

the data.mdb are too big for a scp tranfer, the system drop the connection after some time.

Anyway, below the content of myappliaction2.py, it's really simple (poor):

import caffe

import os

solver=caffe.SGDSolver('/home/u18921/workspaceCZOO/ncappzoo/apps/dogsvscats/bvlc_googlenet/org/solver.prototxt')

solver.solve()

Please, let me know if there are something to add to improve the process.

At the same time, I have another question, how can execute "make" to compile data saved in my home (in Colfax)?

This question is linked to the case, because the make execution of the application I'm using generate the lmdb files and other files.

Thank you

CBell1 · ‎09-14-2018

Hi Deepthi Raj,

I have the train and test lmdb files, please, let me know where I can find a repository for sharing because there is not the option to attach these to this my message.

Thank you

CBell1 · ‎09-14-2018

Hi Deepthi Raj,

I found the possibility to attached files simply changing the visualization mode :-)

OK, please, see attached files as requested.

Exceptions:

train_lmdb_data.mdb

val_lmdb_data.mdb

Too big files

Please, let me know your feedback

Thank you

idata · ‎09-16-2018

Hi Cosma,

Thanks for sharing :)

As mentioned in the previous conversation, please share the below files as well so that we can check where exactly the problem comes in.

1. labelmap.prototxt

2. train_val.prototxt

3. test_name_size.txt

Bus error mainly occurs when we are trying to access an invalid memory address.

Regards,

Deepthi Raj.

CBell1 · ‎09-17-2018

Hi Deepthi Raj,

thanks for your replay.

Sorry, what kind of memory do you mean?

I would like to clarify the contest, I'm trying to apply the following document to customize data using Caffe for NCS:

https://movidius.github.io/blog/deploying-custom-caffe-models/ https://movidius.github.io/blog/deploying-custom-caffe-models/

Below my answer to your request:

2. train_val.prototxt attached

1. labelmap.prototxt

How can I identify the labelmap.prototxt that I have used for data preparation?

As I mention, I'm using NCSDK, v2.05.

Below the list I have on my computer used to prepare the data:

/opt/movidius/ssd-caffe/data/VOC0712/labelmap_voc.prototxt

/opt/movidius/ssd-caffe/data/coco/labelmap_coco.prototxt

/opt/movidius/ssd-caffe/data/ILSVRC2016/labelmap_ilsvrc_det.prototxt

/opt/movidius/ssd-caffe/data/ILSVRC2016/labelmap_ilsvrc_clsloc.prototxt

3. test_name_size.txt

I have not a test_size.txt

Do I have to create one?

Do you have some suggestions about that?

Attached the Makefile I used for dataset preparation and to create lmdb files.

Attached again the solver.prototxt

Hoping this help.

Many thanks again

idata · ‎09-18-2018

Hi Cosma,

Could you please clarify on below queries -

1. Is the Makefile that you shared same as that you used to create lmdb. Because there are sudo commands and the caffe path is pointing to /opt/movidius/caffe which is not there in Dev Cloud.

2. Are you creating lmdb in another machine and trying to train in Dev Cloud?

Also I am attaching Makefile and create-lmdb.sh here which worked for us. Could you please try this on Dev Cloud and train caffe using the generated lmdb.

If this also does not help, please let us know a convenient time for you so that we can set up a Skype call and see what exactly the issue is.

Regards,

Deepthi

CBell1 · ‎09-18-2018

Hi Deepthi,

2. Are you creating lmdb in another machine and trying to train in Dev Cloud?

Yes, it's!

Many thanks for the new Makefile, it was one of my doubt!

I have only changed the Caffe_Path with the new path Colfax Home dir assigned to my user:

/home/u18921/workspaceCZOO/ncappzoo/apps/dogsvscats/bvlc_googlenet/org

This is same path I used within myapplication2.py

I already used your Makefile but there is an error: [unzip] 9 error.

The Data Path in the Makefile are ok.

I suppose the system does not detect the existing train.zip and test1.zip as ZIP files.

So, I'm sending again the original files to Colfax replacing them.

Then, I run again the Make and I'll keep you informed.

Many thanks again for your useful suggestions

Cosma

CBell1 · ‎09-18-2018

Hi Deepthi,

last question: executing the Makefile for dataset preparation, I use the Colfax login server.

Is it right?

Is it eventually possible to use a Colfax Acceleration Node also to compile and data preparation?

Thank you

Cosma

idata · ‎09-18-2018

Hi Cosma,

We have to execute the Makefile from compute node, not from login node. Otherwise, it will throw memory error while creating lmdb.

Enter the compute node using the command "qsub -I" and then execute the Makefile.

Please let me know if this works for you.

Regards,

Deepthi Raj

CBell1 · ‎09-19-2018

Hi Deepthi,

qsub -I Makefile run

Now I'm automatically within [u18921@c009-n089 ~]$

In this status, all command ("qstat", "nano", etc.) are not found

Questions:

sorry, in the Colfax Compute there are information using Jupyter Notebook, I'm using terminal !

1 - How can I verify the status of the job?

2 - Why the "ls" command show the /home/u18921 instead the folder where I started the job?

I'm available for a conf call today at 4:30pm CET

Below the actual output:

[u18921@c009 dogsvscats]$

[u18921@c009 dogsvscats]$ qsub -I Makefile

qsub: waiting for job 165638.c009 to start

qsub: job 165638.c009 ready

#

# Date: Wed Sep 19 00:54:28 PDT 2018

# Job ID: 165638.c009

# User: u18921

# Resources: neednodes=1:ppn=2,nodes=1:ppn=2,vmem=92gb,walltime=06:00:00

#

[u18921@c009-n089 ~]$

[u18921@c009-n089 ~]$ qstat

-bash: qstat: command not found

[u18921@c009-n089 ~]$ nano Makefile

-bash: nano: command not found

[u18921@c009-n089 ~]$

Thank you!

Cosma

idata · ‎09-19-2018

Hi Cosma,

You can execute the Makefile inside compute node in two ways.

A.) Submit the job via qsub

1. In the same folder as that of Makefile, create a file "myjob". Then add following lines.

# PBS -l nodes=1

cd $PBS_O_WORKDIR

make

The first line is a special command that requests one compute node.

The second line ensures that the script runs in the same directory as where you

have submitted it. And the third line executes the Makefile.

2. You can now submit this job as shown below:

[u100@c009 ~]# qsub myjob

This command will return a Job ID, which is the tracking number for your job.

You can track the job with:

[u100@c009 ~]# qstat

Once job is completed, the output will be in the files:

[u100@c009 ~]# cat myjob.oXXXXXX

[u100@c009~]# cat myjob.eXXXXXX

Here 'XXXXXX' is the Job ID. The .o file contains the standard output stream,and .e file contains the error stream

B.) Go to compute node and then execute the Makefile

1. Enter compute node using the command "qsub -I"

This will take you to the home folder inside compute node

2. Go to the folder which contains Makefile

3. Run the command "make"

Please check if this helps, otherwise we can have the call.

Regards,

Deepthi Raj

CBell1 · ‎09-19-2018

Hi Deepthi,

I have applied the option A)

The make command started on Colfax Nodes and finished without issue generating the expected files.

Attached the Makefile and myjob3_Make.py.e165699 file output.

No error! :-)

After this step, I executed the training data using the solveer on Colfax nodes.

Attached the file that execute the operation myapplication2.py, myappl and the error file generated myjob2.py.e165747.

The error is related to train_val.prototxt file that I have not.

Do I have to collect this file from googlenet I have on my PC and save it to the CAFFE_PATH=/home/u18921/workspaceCZOO/ncappzoo/apps/dogsvscats/bvlc_googlenet/org/ (the same path used in Makefile as CAFFE_PATH)?

or

Do I have to change the CAFFE_PATH in Makefile to CAFFE_PATH?=/glob/intel-python/python3/bin (the Caffe path in Colfax systems) and re-compile all?

Thank you again

Cosma