Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.

Raspberry + NCS2 : performance comparison

hamze60
New Contributor I
6,682 Views

Hello,

I've tested below configurations, for Mobilnet+SSD object detection model, and got below results

"OPENCV":
    net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)    
    
"OPENVINO_CPU":
    net = cv2.dnn.readNet(args["xml"], args["bin"])
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)    

"OPENVINO_NCS":
    net = cv2.dnn.readNet(args["xml"], args["bin"])
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)    
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD) 

╔═════════════════════════╦══════════╦══════════════════╦══════════════════╗
║                         ║ OpenCV 4 ║ OpenCV-OpenVino  ║ OpenCV-OpenVino  ║
║                         ║          ║     (IR FP32)    ║  + NCS2(IR FP16) ║
╠═════════════════════════╬══════════╬══════════════════╬══════════════════╣
║ Ubuntu 18 on VirtualBox ║  11 FPS  ║      26 FPS      ║         ?        ║
╠═════════════════════════╬══════════╬══════════════════╬══════════════════╣
║ Raspberry Pi 3 B+       ║  0.6 FPS ║         ?        ║       8 FPS      ║
╚═════════════════════════╩══════════╩══════════════════╩══════════════════╝

According to what was reported by NCS2 official homepage, I expected better performance from NCS2, but I saw similar performance reported by other people. I have below questions:

Q.1) is it possible that the communication between Raspberry and NCS2 be the bottleneck of system? and if move to a board with USB3 port, it get better?

Q.2) while my NCS2 is properly detected by Virtual Box and I can run the demo on get-started page, but for running programs in python, I get below error:

E: [xLink] [    782564] dispatcherEventSend:908	Write failed event -1
E: [xLink] [    794413] dispatcherEventReceive:308	dispatcherEventReceive() Read failed -1 | event 0x7fd96affce80 
E: [xLink] [    794413] eventReader:256	eventReader stopped
E: [ncAPI] [    794413] ncGraphAllocate:1409	Can't read input tensor descriptors of the graph, rc: X_LINK_ERROR

Q.3) while on Ubuntu, I could run FP32 models on CPU target, running same program on Raspberry, generates "failed to initialize Inference Engine backend: Cannot find plugin to use"

Thanks

0 Kudos
1 Solution
Dmitry_K_Intel3
Employee
6,617 Views

The thing is that Raspberry Pi has USB 2.0 and to reduce data transfer delay, you can pass uint8 data instead float32. Using IR you may include preprocessing into the model (scaling and mean subtraction). In case of origin model - you can pass it by setInput.

 

Please try the following code.

import cv2 as cv
import numpy as np
import time

# Load the model
net = cv.dnn.readNet('MobileNetSSD/models/MobileNetSSD_deploy.caffemodel',
                     'MobileNetSSD/models/MobileNetSSD_deploy.prototxt')

# Specify target device
net.setPreferableBackend(cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

# Read an image

img = cv.imread('/home/pi/004545.jpg')

# Prepare input blob and perform an inference
blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

# Warmup
out = net.forward()

start = time.time()

numRuns = 100
for _ in range(numRuns):
  net.forward()

print('FPS: ', numRuns / (time.time() - start))

For my Raspberry Pi 2 model B I can achieve the following efficiency:

NCS1: 9.78 FPS

NCS2: 19.8 FPS

View solution in original post

0 Kudos
22 Replies
Dmitry_K_Intel3
Employee
5,815 Views

Please share a reference to mentioned model and efficiency measurement approach.

0 Kudos
hamze60
New Contributor I
5,815 Views


Dmitry Kurtaev (Intel) (Intel) wrote:

Please share a reference to mentioned model and efficiency measurement approach.

Hello,
The previous reported FPS values were for whole of my program. For more precision, in below table, I only report net FPS, related to inference, measured as below:

    start_it = time.time()    
    detections = net.forward()
    end_it = time.time()
    total_time += (end_it - start_it)
    frame_cnt += 1

╔═════════════════════════╦══════════╦══════════════════╦══════════════════╗
║     Object detection    ║ OpenCV 4 ║ OpenCV-OpenVino  ║ OpenCV-OpenVino  ║
║       Mobilnet+SSD      ║          ║     (IR FP32)    ║  + NCS2(IR FP16) ║
╠═════════════════════════╬══════════╬══════════════════╬══════════════════╣
║ Ubuntu 18 on VirtualBox ║  12 FPS  ║      37 FPS      ║         ?        ║
╠═════════════════════════╬══════════╬══════════════════╬══════════════════╣
║ Raspberry Pi 3 B+       ║  0.6 FPS ║         ?        ║      12 FPS      ║
╚═════════════════════════╩══════════╩══════════════════╩══════════════════╝

Raspberry+NCS2 is around 20 times faster that Raspberry. I am curious to know that if there is a bottleneck in my setup (like using USB2 while NCS2 supports USB3), and I can get more performance.

The mobilnet+SSD model for OpenCV4 test, is original one from here  (deploy version). then I converted it to IR model with optimizer command:    python3 mo.py --input_model $model_file  --data_type "FP32" (or "FP16") --framework caffe

Anyway, all 3 models are available here for download.
Thank you very much

0 Kudos
Dmitry_K_Intel3
Employee
5,815 Views

@ahangari, hamzeh, There is an option to specify in OpenCV which device to use for computations: setPreferableTarget. By default, it uses CPU so 0.6FPS is an efficiency of OpenCV on CPU. You need to specify Myriad device.

 

Try this:

net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

 

Note that in my example there is no IR and OpenCV builds Inference Engine graph in runtime internally.    

0 Kudos
hamze60
New Contributor I
5,815 Views

Dmitry Kurtaev (Intel) (Intel) wrote:

@ahangari, hamzeh, There is an option to specify in OpenCV which device to use for computations: setPreferableTarget. By default, it uses CPU so 0.6FPS is an efficiency of OpenCV on CPU. You need to specify Myriad device.

 

Try this:

net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

 

Note that in my example there is no IR and OpenCV builds Inference Engine graph in runtime internally.    

As far I understand, you say there is no need to convert to IR model, opencv-openvino does it internally. I tested what you suggested, and it did not change performance (even on Ubuntu 18 on VirtualBox, performance dropped slightly). Then, I think this confirm that my conversion of original caffe model to IR model was correct,
Is there any other suggestion? what do you think about Raspberry's USB2? can it be a bottleneck?

 

0 Kudos
Dmitry_K_Intel3
Employee
5,815 Views

Could you please share how many FPS give the following two configurations?

 

    net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)  

 

and

 

    net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

 

Does second one similar to

    net = cv2.dnn.readNetFromCaffe(args["xml"], args["bin"])
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

?     

0 Kudos
hamze60
New Contributor I
5,815 Views

Dmitry Kurtaev (Intel) (Intel) wrote:

Could you please share how many FPS give the following two configurations?

    net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)  

and

    net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

Does second one similar to

    net = cv2.dnn.readNetFromCaffe(args["xml"], args["bin"])
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)

?     

Thanks, You have 3 cases, I summarized results in a table (only on Raspberry). Your case 3 gives error (because readNetFromCaffe can not read IR format. Then I changed readNetFromCaffe to ReadNet. I run the demo for around 1 min to see stable output.

Yes, case 2 and 3 gives same performance.

╔═══════════════════╦════════════════════════════╦═════════════════════════════╦═══════════════╗
║ Object detection  ║        Your case 1:        ║ Your case 2:                ║ Your Case 3:  ║
║ Mobilnet+SSD      ║     no IR (original),      ║ no IR (auto-IR conversion?) ║ IR FP16,      ║
║                   ║       BACKEND_OPENCV,      ║ TARGET_MYRIAD               ║ TARGET_MYRIAD ║
║                   ║         TARGET_CPU         ║                             ║               ║
╠═══════════════════╬══════════╦═════════════════╬═════════════════════════════╩═══════════════╣
║   Which OpenCV?   ║ OpenCV 4 ║ OpenCV-OpenVino ║               OpenCV-OpenVino               ║
╠═══════════════════╬══════════╬═════════════════╬═════════════════════════════╦═══════════════╣
║ Raspberry Pi 3 B+ ║  0.6 FPS ║ 1.4 FPS         ║            12 FPS           ║     12 FPS    ║
╚═══════════════════╩══════════╩═════════════════╩═════════════════════════════╩═══════════════╝

 

0 Kudos
Dmitry_K_Intel3
Employee
5,815 Views

Goi it, thank you! Let me check your models later ro reproduce these numbers and leave some comments of how you can improve the overall efficiency.

0 Kudos
Dmitry_K_Intel3
Employee
6,618 Views

The thing is that Raspberry Pi has USB 2.0 and to reduce data transfer delay, you can pass uint8 data instead float32. Using IR you may include preprocessing into the model (scaling and mean subtraction). In case of origin model - you can pass it by setInput.

 

Please try the following code.

import cv2 as cv
import numpy as np
import time

# Load the model
net = cv.dnn.readNet('MobileNetSSD/models/MobileNetSSD_deploy.caffemodel',
                     'MobileNetSSD/models/MobileNetSSD_deploy.prototxt')

# Specify target device
net.setPreferableBackend(cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

# Read an image

img = cv.imread('/home/pi/004545.jpg')

# Prepare input blob and perform an inference
blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

# Warmup
out = net.forward()

start = time.time()

numRuns = 100
for _ in range(numRuns):
  net.forward()

print('FPS: ', numRuns / (time.time() - start))

For my Raspberry Pi 2 model B I can achieve the following efficiency:

NCS1: 9.78 FPS

NCS2: 19.8 FPS

0 Kudos
hamze60
New Contributor I
5,815 Views

Thanks alot for following this thread!

I will run your code and report it by tomorrow. I am also thinking about USB 2.0, as be the bottleneck, but not sure yet. Can you also share running same code on your own PC+NCS with a USB 3.0?. Assuming the processing core of NCS run by same speed on all systems, this reveals how much performance is wasted due to USB 2.

 

 

0 Kudos
Dmitry_K_Intel3
Employee
5,815 Views

I got the following numbers using my Ubuntu PC with USB 2.0 and USB 3.0 ports (code samples is the same as above):

| Hardware |  USB 2.0 |  USB 3.0 | RPI (USB 2.0) |
|----------|----------|----------|---------------|
|   NCS 1  | 10.4 FPS | 10.9 FPS |      9.78 FPS |
|   NCS 2  | 21.1 FPS | 26.5 FPS |      19.8 FPS |

So we miss about one frame per second comparing desktop and RPI.

We are experimenting with asynchronous API of Inference Engine now and the numbers show that without any processing besides inference we can achieve about 5% more FPS for MobileNetSSD (desktop app, USB 3.0, NCS 1). The best thing about asynchronous invocations is that we can hide data bottlenecks from the outcome FPS. Please keep in touch on this thread and I can share the numbers with asynchronous API for USB 2.0 on Ubuntu machine so we can compare if it really can reduce the difference between USB 2.0 and USB 3.0 for NCS 2 significantly.

See https://github.com/opencv/opencv/pull/13694 for details.

 

0 Kudos
Reinberger__Thomas
5,815 Views

I also tried out performance on a MobileNetSSD with the repo found at https://github.com/PINTO0309/MobileNet-SSD-RealSense . Dmitry, I optimized the original code from that repo

blob = cv2.dnn.blobFromImage(color_image, 0.007843, size=(300, 300), mean=(127.5,127.5,127.5), swapRB=False, crop=False)

to

blob = cv2.dnn.blobFromImage(color_image, size=(300, 300), ddepth=cv2.CV_8U)
        net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

and indeed the framerate goes up from about 9 to 15FPS - BUT ... as can be seen in the output of the object detection demos, the network doesn't predict properly anymore (wrong labels, bounding boxes always nearly fill the entire output screen).

The code I'm using with the PiCam can be found here: https://gist.github.com/treinberger/c63cb84979a4b3fb9b13a2d290482f4e , but the USB Cam code from the repo above is basically the same.

What could be the problem? 

 

0 Kudos
Reinberger__Thomas
5,815 Views

I investigated in my previous, yet not moderator-approved post that optimizing the performance by offloading mean subtraction, scaling and using 8U instead of 32F didn't work for me (although not using the MobileNetSSD caffe model but MobileNetSSD from tf). I found out that *scaling* using setInput(...) breaks the prediction.

blob = cv2.dnn.blobFromImage(color_image, size=(300, 300), scalefactor = 0.007843, swapRB=False, crop=False, ddepth=cv2.CV_32F)
net.setInput(blob, mean=(127.5, 127.5, 127.5))

works well, whereas this

blob = cv2.dnn.blobFromImage(color_image, size=(300, 300), swapRB=False, crop=False, ddepth=cv2.CV_32F)
net.setInput(blob, scalefactor = 0.007843, mean=(127.5, 127.5, 127.5))

doesn't. So I guess, blobFromImage and setInput behave differently with respect to scaling and mean subtraction. Looking into https://github.com/opencv/opencv/blob/master/modules/dnn/src/dnn.cpp, it seems that setInput does scaling first and then mean subtraction:

impl->netInputLayer->scaleFactors[pin.oid] = scalefactor;

impl->netInputLayer->means[pin.oid] = mean;

, whereas blobFromImage does it the other way round:

images -= mean;

images *= scalefactor;

 

Anyone can reproduce the problem with the MobileNetSSD caffe model?

0 Kudos
hamze60
New Contributor I
5,815 Views

Thanks Dmitry!
I confirm that I also got 20 FPS from your benchmark code, on Raspberry+NCS2 (instead of a constant image, I used camera input to be sure that it does not affect performance). But there is a point. Your setting :

blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

doest not work for me. The object detector does not find meaningful objects, but some junk ones. Can you check it? before this, without ddepth=cv.CV_8U, I used below setting which worked for me, and I got 12-14 FPS.

blob = cv2.dnn.blobFromImage(cv2.resize(frame, (300, 300)), scalefactor=1.0/127.5, size=(300, 300), mean=[127.5, 127.5, 127.5])
net.setInput(blob)

Thanks

0 Kudos
Dmitry_K_Intel3
Employee
5,815 Views

@ahangari, hamzeh, Is it for IR model or for Caffemodel? If you use IR model perhaps you included preprocessing normalization inside it so scalefactor and mean subtraction are not needed.

0 Kudos
hamze60
New Contributor I
5,815 Views

@Dmitry Kurtaev

@Reinberger, Thomas

Thanks Dmitry!
I confirm that with below setting, with original caffe model (No IR), the RaPi+NCS2 worked and I got 20 FPS. Previously, with IR-FP16 model, object detector worked strangely by detecting meaningless objects.
Generally, while OpenVino converts models internally to IR model, I do not know why should we use optimizer and IR models at all.

blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

 

0 Kudos
Dmitry_K_Intel3
Employee
5,815 Views

ahangari, hamzeh, Actually, Model Optimizer supports more frameworks and topologies than OpenCV so if some of model is not supported in OpenCV directly - you may convert it to IR. Moreover if you load FP16 IR models, top memory consumption less than loading an origin FP32 model.

0 Kudos
hamze60
New Contributor I
5,815 Views

Hi Dmitry,

I am going to also prepare same comparison result for Yolo3, which is a heavier model. This can give a better measure about NCS2 performance, when compared to Mobilenet+SSD.

The previous suggestion you had, to use directly original model (this time, not caffe, but Darknet) did not work. Then I converted to IR myself. but still have problem in reading the detection result. I asked a question about it in another thread. I also noticed that I am not the single person who has problem with Yolo3, for example see this one.

It is great if you can help to prepare this comparison too.

 

0 Kudos
fu__cfu
Beginner
5,815 Views

Dmitry Kurtaev (Intel) wrote:

The thing is that Raspberry Pi has USB 2.0 and to reduce data transfer delay, you can pass uint8 data instead float32. Using IR you may include preprocessing into the model (scaling and mean subtraction). In case of origin model - you can pass it by setInput.

 

Please try the following code.

import cv2 as cv
import numpy as np
import time

# Load the model
net = cv.dnn.readNet('MobileNetSSD/models/MobileNetSSD_deploy.caffemodel',
                     'MobileNetSSD/models/MobileNetSSD_deploy.prototxt')

# Specify target device
net.setPreferableBackend(cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

# Read an image

img = cv.imread('/home/pi/004545.jpg')

# Prepare input blob and perform an inference
blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

start = time.time()

numRuns = 100
for _ in range(numRuns):
  net.forward()

print('FPS: ', numRuns / (time.time() - start))

For my Raspberry Pi 2 model B I can achieve the following efficiency:

NCS1: 9.78 FPS

NCS2: 19.8 FPS

 

 

Hi,

 

I am able to reproduce the similar fps(19.8) with NCS2, however, if I modify the code to be like 

import cv2 as cv
import numpy as np
import time

# Load the model
net = cv.dnn.readNet('MobileNetSSD/models/MobileNetSSD_deploy.caffemodel',
                     'MobileNetSSD/models/MobileNetSSD_deploy.prototxt')

# Specify target device
net.setPreferableBackend(cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

# Read an image

img = cv.imread('/home/pi/004545.jpg')

start = time.time()
numRuns = 100

for _ in range(numRuns):
  # Prepare input blob and perform an inference 
  blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U) 
  net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

  # Warmup
  out = net.forward()  

  net.forward()

print('FPS: ', numRuns / (time.time() - start))

 

Since I would like to process each frame of a video, the FPS will drop to 6.7, any suggestion?

0 Kudos
fu__cfu
Beginner
5,813 Views

Dmitry Kurtaev (Intel) wrote:

The thing is that Raspberry Pi has USB 2.0 and to reduce data transfer delay, you can pass uint8 data instead float32. Using IR you may include preprocessing into the model (scaling and mean subtraction). In case of origin model - you can pass it by setInput.

 

Please try the following code.

import cv2 as cv
import numpy as np
import time

# Load the model
net = cv.dnn.readNet('MobileNetSSD/models/MobileNetSSD_deploy.caffemodel',
                     'MobileNetSSD/models/MobileNetSSD_deploy.prototxt')

# Specify target device
net.setPreferableBackend(cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

# Read an image

img = cv.imread('/home/pi/004545.jpg')

# Prepare input blob and perform an inference
blob = cv.dnn.blobFromImage(img, size=(300, 300), ddepth=cv.CV_8U)
net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5, 127.5, 127.5])

# Warmup
out = net.forward()

start = time.time()

numRuns = 100
for _ in range(numRuns):
  net.forward()

print('FPS: ', numRuns / (time.time() - start))

For my Raspberry Pi 2 model B I can achieve the following efficiency:

NCS1: 9.78 FPS

NCS2: 19.8 FPS

 

Hi Dmitry,

 

I am able to reproduce the result(19.8 FPS) with NCS2. However, if I reload the blob in every loop(because I am process frames from a video),  the fps will drop to 6.7, any suggestion?

0 Kudos
Kulecz__Walter
New Contributor I
5,072 Views

For my Raspberry Pi 2 model B I can achieve the following efficiency:

NCS1: 9.78 FPS

NCS2: 19.8 FPS

I am able to reproduce the result(19.8 FPS) with NCS2. However, if I reload the blob in every loop(because I am process frames from a video),  the fps will drop to 6.7, any suggestion?

There are frame rates for bragging rights and then there are real frame rates that include the all the overhead needed to actually do something useful

With multi-threaded code I'm able to get ~8.3 fps on a Pi3 B+ with NCS2 and OpenVINO sampling 5 Onvif netcams and "real-time" monitoring on the attached monitor.

Basically one thread per  camera, each camera writes to its own queue.  Another thread reads each queue in sequence and does the inference, writing the output to sixth queue.  The main program (thread) reads this output queue and takes what ever action is required.

Same code on a faster Odroid XU-4 (I hacked seupvars,sh to get it installed) gets about ~15 fps.

OTOH, same code, CPU TARGET, no NCS on an i5 4200U gets ~21 fps,  Using the NCS2 and TARGET MYRAID gets ~22fps.

This suggests in real usage the main bottlenecks are not the actual inference but all the overhead of getting the data in and out and using the inference.

 

 

0 Kudos
Reply