Solved: Goi it, thank you! Let me

hamze60 · ‎01-24-2019

Hello,

I've tested below configurations, for Mobilnet+SSD object detection model, and got below results

"OPENCV":
   net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
   net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
   net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

"OPENVINO_CPU":
   net = cv2.dnn.readNet(args["xml"], args["bin"])
   net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
   net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

"OPENVINO_NCS":
   net = cv2.dnn.readNet(args["xml"], args["bin"])
   net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
   net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

╔═════════════════════════╦══════════╦══════════════════╦══════════════════╗
║                         ║ OpenCV 4 ║ OpenCV-OpenVino  ║ OpenCV-OpenVino  ║
║                         ║          ║     (IR FP32)    ║  + NCS2(IR FP16) ║
╠═════════════════════════╬══════════╬══════════════════╬══════════════════╣
║ Ubuntu 18 on VirtualBox ║  11 FPS  ║      26 FPS      ║         ?        ║
╠═════════════════════════╬══════════╬══════════════════╬══════════════════╣
║ Raspberry Pi 3 B+       ║  0.6 FPS ║         ?        ║       8 FPS      ║
╚═════════════════════════╩══════════╩══════════════════╩══════════════════╝

According to what was reported by NCS2 official homepage, I expected better performance from NCS2, but I saw similar performance reported by other people. I have below questions:

Q.1) is it possible that the communication between Raspberry and NCS2 be the bottleneck of system? and if move to a board with USB3 port, it get better?

Q.2) while my NCS2 is properly detected by Virtual Box and I can run the demo on get-started page, but for running programs in python, I get below error:

E: [xLink] [    782564] dispatcherEventSend:908	Write failed event -1
E: [xLink] [    794413] dispatcherEventReceive:308	dispatcherEventReceive() Read failed -1 | event 0x7fd96affce80 
E: [xLink] [    794413] eventReader:256	eventReader stopped
E: [ncAPI] [    794413] ncGraphAllocate:1409	Can't read input tensor descriptors of the graph, rc: X_LINK_ERROR

Q.3) while on Ubuntu, I could run FP32 models on CPU target, running same program on Raspberry, generates "failed to initialize Inference Engine backend: Cannot find plugin to use"

Thanks

Dmitry_K_Intel3 · ‎01-27-2019

The thing is that Raspberry Pi has USB 2.0 and to reduce data transfer delay, you can pass uint8 data instead float32. Using IR you may include preprocessing into the model (scaling and mean subtraction). In case of origin model - you can pass it by setInput.

Please try the following code.

import cv2 as cv
import numpy as np
import time

# Load the model
net = cv.dnn.readNet('MobileNetSSD/models/MobileNetSSD_deploy.caffemodel',
                     'MobileNetSSD/models/MobileNetSSD_deploy.prototxt')

# Specify target device
net.setPreferableBackend(cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

# Read an image

img = cv.imread('/home/pi/004545.jpg')

# Prepare input blob and perform an inference
blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

# Warmup
out = net.forward()

start = time.time()

numRuns = 100
for _ in range(numRuns):
  net.forward()

print('FPS: ', numRuns / (time.time() - start))

For my Raspberry Pi 2 model B I can achieve the following efficiency:

NCS1: 9.78 FPS

NCS2: 19.8 FPS

View solution in original post

Dmitry_K_Intel3 · ‎01-24-2019

Please share a reference to mentioned model and efficiency measurement approach.

hamze60 · ‎01-25-2019

Dmitry Kurtaev (Intel) (Intel) wrote:
Please share a reference to mentioned model and efficiency measurement approach.

Hello,
The previous reported FPS values were for whole of my program. For more precision, in below table, I only report net FPS, related to inference, measured as below:

   start_it = time.time()
   detections = net.forward()
   end_it = time.time()
   total_time += (end_it - start_it)
   frame_cnt += 1

╔═════════════════════════╦══════════╦══════════════════╦══════════════════╗
║     Object detection    ║ OpenCV 4 ║ OpenCV-OpenVino  ║ OpenCV-OpenVino  ║
║       Mobilnet+SSD      ║          ║     (IR FP32)    ║  + NCS2(IR FP16) ║
╠═════════════════════════╬══════════╬══════════════════╬══════════════════╣
║ Ubuntu 18 on VirtualBox ║  12 FPS  ║      37 FPS      ║         ?        ║
╠═════════════════════════╬══════════╬══════════════════╬══════════════════╣
║ Raspberry Pi 3 B+       ║  0.6 FPS ║         ?        ║      12 FPS      ║
╚═════════════════════════╩══════════╩══════════════════╩══════════════════╝

Raspberry+NCS2 is around 20 times faster that Raspberry. I am curious to know that if there is a bottleneck in my setup (like using USB2 while NCS2 supports USB3), and I can get more performance.

The mobilnet+SSD model for OpenCV4 test, is original one from here (deploy version). then I converted it to IR model with optimizer command: python3 mo.py --input_model $model_file --data_type "FP32" (or "FP16") --framework caffe

Anyway, all 3 models are available here for download.
Thank you very much

Dmitry_K_Intel3 · ‎01-25-2019

@ahangari, hamzeh, There is an option to specify in OpenCV which device to use for computations: setPreferableTarget. By default, it uses CPU so 0.6FPS is an efficiency of OpenCV on CPU. You need to specify Myriad device.

Try this:

net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

Note that in my example there is no IR and OpenCV builds Inference Engine graph in runtime internally.

hamze60 · ‎01-25-2019

Dmitry Kurtaev (Intel) (Intel) wrote:
@ahangari, hamzeh, There is an option to specify in OpenCV which device to use for computations: setPreferableTarget. By default, it uses CPU so 0.6FPS is an efficiency of OpenCV on CPU. You need to specify Myriad device.

Try this:
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

Note that in my example there is no IR and OpenCV builds Inference Engine graph in runtime internally.

As far I understand, you say there is no need to convert to IR model, opencv-openvino does it internally. I tested what you suggested, and it did not change performance (even on Ubuntu 18 on VirtualBox, performance dropped slightly). Then, I think this confirm that my conversion of original caffe model to IR model was correct,
Is there any other suggestion? what do you think about Raspberry's USB2? can it be a bottleneck?

Dmitry_K_Intel3 · ‎01-25-2019

Could you please share how many FPS give the following two configurations?

   net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
   net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
   net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

and

   net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
   net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
   net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

Does second one similar to

net = cv2.dnn.readNetFromCaffe(args["xml"], args["bin"])
net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

?

hamze60 · ‎01-25-2019

Dmitry Kurtaev (Intel) (Intel) wrote:
Could you please share how many FPS give the following two configurations?
    net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
and
    net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
Does second one similar to
    net = cv2.dnn.readNetFromCaffe(args["xml"], args["bin"])
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
?

Thanks, You have 3 cases, I summarized results in a table (only on Raspberry). Your case 3 gives error (because readNetFromCaffe can not read IR format. Then I changed readNetFromCaffe to ReadNet. I run the demo for around 1 min to see stable output.

Yes, case 2 and 3 gives same performance.

╔═══════════════════╦════════════════════════════╦═════════════════════════════╦═══════════════╗
║ Object detection  ║        Your case 1:        ║ Your case 2:                ║ Your Case 3:  ║
║ Mobilnet+SSD      ║     no IR (original),      ║ no IR (auto-IR conversion?) ║ IR FP16,      ║
║                   ║       BACKEND_OPENCV,      ║ TARGET_MYRIAD               ║ TARGET_MYRIAD ║
║                   ║         TARGET_CPU         ║                             ║               ║
╠═══════════════════╬══════════╦═════════════════╬═════════════════════════════╩═══════════════╣
║   Which OpenCV?   ║ OpenCV 4 ║ OpenCV-OpenVino ║               OpenCV-OpenVino               ║
╠═══════════════════╬══════════╬═════════════════╬═════════════════════════════╦═══════════════╣
║ Raspberry Pi 3 B+ ║  0.6 FPS ║ 1.4 FPS         ║            12 FPS           ║     12 FPS    ║
╚═══════════════════╩══════════╩═════════════════╩═════════════════════════════╩═══════════════╝

Dmitry_K_Intel3 · ‎01-25-2019

Goi it, thank you! Let me check your models later ro reproduce these numbers and leave some comments of how you can improve the overall efficiency.

Dmitry_K_Intel3 · ‎01-27-2019

The thing is that Raspberry Pi has USB 2.0 and to reduce data transfer delay, you can pass uint8 data instead float32. Using IR you may include preprocessing into the model (scaling and mean subtraction). In case of origin model - you can pass it by setInput.

Please try the following code.

import cv2 as cv
import numpy as np
import time

# Load the model
net = cv.dnn.readNet('MobileNetSSD/models/MobileNetSSD_deploy.caffemodel',
                     'MobileNetSSD/models/MobileNetSSD_deploy.prototxt')

# Specify target device
net.setPreferableBackend(cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

# Read an image

img = cv.imread('/home/pi/004545.jpg')

# Prepare input blob and perform an inference
blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

# Warmup
out = net.forward()

start = time.time()

numRuns = 100
for _ in range(numRuns):
  net.forward()

print('FPS: ', numRuns / (time.time() - start))

For my Raspberry Pi 2 model B I can achieve the following efficiency:

NCS1: 9.78 FPS

NCS2: 19.8 FPS

hamze60 · ‎01-27-2019

Thanks alot for following this thread!

I will run your code and report it by tomorrow. I am also thinking about USB 2.0, as be the bottleneck, but not sure yet. Can you also share running same code on your own PC+NCS with a USB 3.0?. Assuming the processing core of NCS run by same speed on all systems, this reveals how much performance is wasted due to USB 2.

Dmitry_K_Intel3 · ‎01-27-2019

I got the following numbers using my Ubuntu PC with USB 2.0 and USB 3.0 ports (code samples is the same as above):

| Hardware |  USB 2.0 |  USB 3.0 | RPI (USB 2.0) |
|----------|----------|----------|---------------|
|   NCS 1  | 10.4 FPS | 10.9 FPS |      9.78 FPS |
|   NCS 2  | 21.1 FPS | 26.5 FPS |      19.8 FPS |

So we miss about one frame per second comparing desktop and RPI.

We are experimenting with asynchronous API of Inference Engine now and the numbers show that without any processing besides inference we can achieve about 5% more FPS for MobileNetSSD (desktop app, USB 3.0, NCS 1). The best thing about asynchronous invocations is that we can hide data bottlenecks from the outcome FPS. Please keep in touch on this thread and I can share the numbers with asynchronous API for USB 2.0 on Ubuntu machine so we can compare if it really can reduce the difference between USB 2.0 and USB 3.0 for NCS 2 significantly.

See https://github.com/opencv/opencv/pull/13694 for details.

Reinberger__Thomas · ‎01-27-2019

I also tried out performance on a MobileNetSSD with the repo found at https://github.com/PINTO0309/MobileNet-SSD-RealSense . Dmitry, I optimized the original code from that repo

blob = cv2.dnn.blobFromImage(color_image, 0.007843, size=(300, 300), mean=(127.5,127.5,127.5), swapRB=False, crop=False)

to

blob = cv2.dnn.blobFromImage(color_image, size=(300, 300), ddepth=cv2.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

and indeed the framerate goes up from about 9 to 15FPS - BUT ... as can be seen in the output of the object detection demos, the network doesn't predict properly anymore (wrong labels, bounding boxes always nearly fill the entire output screen).

The code I'm using with the PiCam can be found here: https://gist.github.com/treinberger/c63cb84979a4b3fb9b13a2d290482f4e , but the USB Cam code from the repo above is basically the same.

What could be the problem?

Reinberger__Thomas · ‎01-28-2019

I investigated in my previous, yet not moderator-approved post that optimizing the performance by offloading mean subtraction, scaling and using 8U instead of 32F didn't work for me (although not using the MobileNetSSD caffe model but MobileNetSSD from tf). I found out that *scaling* using setInput(...) breaks the prediction.

blob = cv2.dnn.blobFromImage(color_image, size=(300, 300), scalefactor = 0.007843, swapRB=False, crop=False, ddepth=cv2.CV_32F)
net.setInput(blob, mean=(127.5, 127.5, 127.5))

works well, whereas this

blob = cv2.dnn.blobFromImage(color_image, size=(300, 300), swapRB=False, crop=False, ddepth=cv2.CV_32F)
net.setInput(blob, scalefactor = 0.007843, mean=(127.5, 127.5, 127.5))

doesn't. So I guess, blobFromImage and setInput behave differently with respect to scaling and mean subtraction. Looking into https://github.com/opencv/opencv/blob/master/modules/dnn/src/dnn.cpp, it seems that setInput does scaling first and then mean subtraction:

impl->netInputLayer->scaleFactors[pin.oid] = scalefactor;

impl->netInputLayer->means[pin.oid] = mean;

, whereas blobFromImage does it the other way round:

images -= mean;

images *= scalefactor;

Anyone can reproduce the problem with the MobileNetSSD caffe model?

hamze60 · ‎01-28-2019

Thanks Dmitry!
I confirm that I also got 20 FPS from your benchmark code, on Raspberry+NCS2 (instead of a constant image, I used camera input to be sure that it does not affect performance). But there is a point. Your setting :

blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

doest not work for me. The object detector does not find meaningful objects, but some junk ones. Can you check it? before this, without ddepth=cv.CV_8U, I used below setting which worked for me, and I got 12-14 FPS.

blob = cv2.dnn.blobFromImage(cv2.resize(frame, (300, 300)), scalefactor=1.0/127.5, size=(300, 300), mean=[127.5, 127.5, 127.5])
net.setInput(blob)

Thanks

Dmitry_K_Intel3 · ‎01-28-2019

@ahangari, hamzeh, Is it for IR model or for Caffemodel? If you use IR model perhaps you included preprocessing normalization inside it so scalefactor and mean subtraction are not needed.

hamze60 · ‎01-29-2019

@Dmitry Kurtaev

@Reinberger, Thomas

Thanks Dmitry!
I confirm that with below setting, with original caffe model (No IR), the RaPi+NCS2 worked and I got 20 FPS. Previously, with IR-FP16 model, object detector worked strangely by detecting meaningless objects.
Generally, while OpenVino converts models internally to IR model, I do not know why should we use optimizer and IR models at all.

blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

Dmitry_K_Intel3 · ‎01-29-2019

ahangari, hamzeh, Actually, Model Optimizer supports more frameworks and topologies than OpenCV so if some of model is not supported in OpenCV directly - you may convert it to IR. Moreover if you load FP16 IR models, top memory consumption less than loading an origin FP32 model.

hamze60 · ‎01-31-2019

Hi Dmitry,

I am going to also prepare same comparison result for Yolo3, which is a heavier model. This can give a better measure about NCS2 performance, when compared to Mobilenet+SSD.

The previous suggestion you had, to use directly original model (this time, not caffe, but Darknet) did not work. Then I converted to IR myself. but still have problem in reading the detection result. I asked a question about it in another thread. I also noticed that I am not the single person who has problem with Yolo3, for example see this one.

It is great if you can help to prepare this comparison too.

fu__cfu · ‎04-07-2019

Dmitry Kurtaev (Intel) wrote:
The thing is that Raspberry Pi has USB 2.0 and to reduce data transfer delay, you can pass uint8 data instead float32. Using IR you may include preprocessing into the model (scaling and mean subtraction). In case of origin model - you can pass it by setInput.

Please try the following code.
import cv2 as cv
import numpy as np
import time

# Load the model
net = cv.dnn.readNet('MobileNetSSD/models/MobileNetSSD_deploy.caffemodel',
                     'MobileNetSSD/models/MobileNetSSD_deploy.prototxt')

# Specify target device
net.setPreferableBackend(cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

# Read an image

img = cv.imread('/home/pi/004545.jpg')

# Prepare input blob and perform an inference
blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

start = time.time()

numRuns = 100
for _ in range(numRuns):
  net.forward()

print('FPS: ', numRuns / (time.time() - start))
For my Raspberry Pi 2 model B I can achieve the following efficiency:

NCS1: 9.78 FPS

NCS2: 19.8 FPS

Hi,

I am able to reproduce the similar fps(19.8) with NCS2, however, if I modify the code to be like

import cv2 as cv
import numpy as np
import time

# Load the model
net = cv.dnn.readNet('MobileNetSSD/models/MobileNetSSD_deploy.caffemodel',
                     'MobileNetSSD/models/MobileNetSSD_deploy.prototxt')

# Specify target device
net.setPreferableBackend(cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

# Read an image

img = cv.imread('/home/pi/004545.jpg')

start = time.time()
numRuns = 100

for _ in range(numRuns):
  # Prepare input blob and perform an inference 
  blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U) 
  net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

  # Warmup
  out = net.forward()  

  net.forward()

print('FPS: ', numRuns / (time.time() - start))

Since I would like to process each frame of a video, the FPS will drop to 6.7, any suggestion?

fu__cfu · ‎04-08-2019

Dmitry Kurtaev (Intel) wrote:
The thing is that Raspberry Pi has USB 2.0 and to reduce data transfer delay, you can pass uint8 data instead float32. Using IR you may include preprocessing into the model (scaling and mean subtraction). In case of origin model - you can pass it by setInput.

Please try the following code.
import cv2 as cv
import numpy as np
import time

# Load the model
net = cv.dnn.readNet('MobileNetSSD/models/MobileNetSSD_deploy.caffemodel',
                     'MobileNetSSD/models/MobileNetSSD_deploy.prototxt')

# Specify target device
net.setPreferableBackend(cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

# Read an image

img = cv.imread('/home/pi/004545.jpg')

# Prepare input blob and perform an inference
blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

# Warmup
out = net.forward()

start = time.time()

numRuns = 100
for _ in range(numRuns):
  net.forward()

print('FPS: ', numRuns / (time.time() - start))
For my Raspberry Pi 2 model B I can achieve the following efficiency:

NCS1: 9.78 FPS

NCS2: 19.8 FPS

Hi Dmitry,

I am able to reproduce the result(19.8 FPS) with NCS2. However, if I reload the blob in every loop(because I am process frames from a video), the fps will drop to 6.7, any suggestion?

Kulecz__Walter · ‎04-20-2019

For my Raspberry Pi 2 model B I can achieve the following efficiency:

NCS1: 9.78 FPS

NCS2: 19.8 FPS

I am able to reproduce the result(19.8 FPS) with NCS2. However, if I reload the blob in every loop(because I am process frames from a video), the fps will drop to 6.7, any suggestion?

There are frame rates for bragging rights and then there are real frame rates that include the all the overhead needed to actually do something useful

With multi-threaded code I'm able to get ~8.3 fps on a Pi3 B+ with NCS2 and OpenVINO sampling 5 Onvif netcams and "real-time" monitoring on the attached monitor.

Basically one thread per camera, each camera writes to its own queue. Another thread reads each queue in sequence and does the inference, writing the output to sixth queue. The main program (thread) reads this output queue and takes what ever action is required.

Same code on a faster Odroid XU-4 (I hacked seupvars,sh to get it installed) gets about ~15 fps.

OTOH, same code, CPU TARGET, no NCS on an i5 4200U gets ~21 fps, Using the NCS2 and TARGET MYRAID gets ~22fps.

This suggests in real usage the main bottlenecks are not the actual inference but all the overhead of getting the data in and out and using the inference.

Raspberry + NCS2 : performance comparison