Neural Compute Stick 2 + Python + Raspberry Pi Problems

Drakopoulos__Fotis · ‎01-18-2019

Hello all,

I have purchased a NCS2 and for a while now I am trying to run on it my audio signal processing model using a raspberry pi and python. I opened a new topic to summarize my problems so far, in case someone has encountered something similar.

First of all, I converted my model from tensorflow to IR, using a FP16 data type, in order to run it on the NCS2 stick (MYRIAD plugin) connected to the raspberry pi. I tried to run it on the NCS2 but, even if the input is zeros, the output is always very noisy (also in a kind of periodic way) and I still don't know what is causing this. What I've also noticed is that the output of the model has a 'float32' dtype, which I am not sure if it is a problem but I would excpect a 'float16' dtype from a FP16 precision model.

While trying to debug the process and find the cause of this, I tried to find if there are any unsupported layers of my model by the MYRIAD plugin and I noticed that there are several layers of type "Const" that are supposed to be unsupported and also the main input layer. What should I do in this case? Is there a way to fix this?

Thank you in advance,

Fotis

Drakopoulos__Fotis · ‎02-20-2019

Since there have been no responses, I'm posting my progress so far in case someone has experienced something similar.

I've succeeded in converting my audio processing keras model to an IR representation (changing some model parameters and using the --disable_nhwc_to_nchw flag) but I still get a slow performance using the NCS2. I've converted my model both in FP32 and in FP16 format and I've used them on the CPU and on the MYRIAD of an ubuntu machine for a sample input. I have attached two text files containing the performance counts for both cases.

The output in both cases is more or less the same (also the desired one) but, as you can see, the model executes in about 8 ms on the CPU (pretty fast) on average but it needs about 36 ms on the MYRIAD (almost 5 times slower). I also get the input layer as an unsupported layer on the MYRIAD, although I am not sure whether it affects the output or the performance. In general, all layers seem to be getting slower on the MYRIAD (especially the deconvolution ones).

I have tried several parameter changes in my model to improve the performance on the NCS2 but so far no progress. Since I'm stuck with this for a while now any help is really appreciated.

Fotis

Hill__Aaron · ‎02-20-2019

Fotis,

It is not clear if those results are from the PI or a host PC. The title states Raspberry Pi but you also indicate in your post that you "used them on the CPU and on the MYRIAD of an ubuntu machine"

I am developing on a windows machine and have a canned MNIST network I have been testing with. When running the inference on the host machine's CPU (a Xeon 2.8GHz quad core) with a batch size of 1 takes about 3.3s to process 10K samples but when inferencing on the NCS2 it takes 22.3s (about 7 times longer). For batch size of 32 CPU takes 0.56s and the NCS2 takes 7.11s. So you can see that a powerful desktop CPU is going to beat the MYRIAD for timing, probably always.

I think you will see improvement when comparing the pi CPU (ARM) to the NCS2 on a pi. Unless this was your set up and the pi ARM core is doing 8x better than the NCS2, then I am not sure what is up. Perhaps it is related to this bit of information: https://hackaday.io/project/163679-aipi ...maybe the pi is "bottle-necking" the NCS2.

Hyodo__Katsuya · ‎02-20-2019

Because the performance of MKL-DNN and clDNN is good, CPU and GPU are faster than NCS2. If you have "Atom", "Core Series" CPU, or "Intel HD Graphics Series" GPU, that will be faster. This depends on my own benchmark results. In my environment, the performance difference is more than twice to four times as great. Also, there seems to be a layer in the NCS API that can not exhibit sufficient performance. For example, softmax layer. Other variety.

Drakopoulos__Fotis · ‎02-21-2019

Hyodo, Katsuya wrote:
Because the performance of MKL-DNN and clDNN is good, CPU and GPU are faster than NCS2.
If you have "Atom", "Core Series" CPU, or "Intel HD Graphics Series" GPU, that will be faster.
This depends on my own benchmark results.
In my environment, the performance difference is more than twice to four times as great.
Also, there seems to be a layer in the NCS API that can not exhibit sufficient performance.
For example, softmax layer. Other variety.

Okay, I can understand this, but in my case the performance on the CPU of a raspberry pi, for instance, is still almost twice as fast as the one on the MYRIAD. Also, in my model I don't use a Softmax layer so I still need to find what is causing this degradation in performance.

Hyodo__Katsuya · ‎02-21-2019

I understood the situation. This may be an implementation problem. If it is not troublesome, please look at the following. NCS2 + RaspberryPi 3 has four times the performance of CPU. https://github.com/PINTO0309/MobileNet-SSD-RealSense　 https://github.com/PINTO0309/OpenVINO-YoloV3　 Loss due to image transfer time using USB2.0 is large. Performance may be improved by processing multiple requests asynchronously.

Drakopoulos__Fotis · ‎03-14-2019

I still haven't been able to solve the slow performance issue of my model. I tried training both with nhwc and nchw but it didn't help.

As mentioned in my previous post, the delay seems to be caused by the Deconvolution layers. Below you can see one deconvolution layer from the converted model's xml file:

		<layer id="87" name="conv2d_transpose_7/conv2d_transpose" precision="FP16" type="Deconvolution">
			<data auto_pad="same_upper" kernel="15,1" output="1" pads_begin="6,0" pads_end="7,0" strides="2,1"/>
			<input>
				<port id="0">
					<dim>1</dim>
					<dim>32</dim>
					<dim>512</dim>
					<dim>1</dim>
				</port>
			</input>
			<output>
				<port id="2">
					<dim>1</dim>
					<dim>1</dim>
					<dim>1024</dim>
					<dim>1</dim>
				</port>
			</output>
			<blobs>
				<weights offset="0" size="960"/>
			</blobs>
		</layer>

Does anyone have an idea on what's causing this slow computation or how to change the deconv layers to perform faster? Any feedback will be well appreciated :)

Fotis