Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.

NCS2 x4 + MultiProcess + Core i7 + YoloV3, Boosted about 13 FPS (A little slow)

Hyodo__Katsuya
Innovator
982 Views
Hello, everyone. I tried implementing NCS2 + MultiProcess + YoloV3. I would be glad if it would be helpful to everyone. YoloV3 (asynchronous) NCS2 x4 ---> 4 FPS -Github https://github.com/PINTO0309/OpenVINO-YoloV3.git openvino_yolov3_MultiStick_test.py -Youtube https://youtu.be/3mKkPXpIc_U
0 Kudos
14 Replies
Yuanyuan_L_Intel
Employee
982 Views

Hi, Hyodo, Katsuya

I took a look at your mutliple stick sample, openvino_yolov3_MultiStick_test.py. The app does not utilize all of  the multiple ncs2 sticks. In that app, it uses the async API and multiple infer requests. That is good. But, multiple infere requests are not scheduled to multiple sticks. The performance gain from mutlitple infer requests comes from the hidden data transfer cost. Only one ExecutableNetwork instance created, so only one ncs2 device was used.  You can try to monitor the fps when you only plug in one ncs2 there.  If you want to make use of multiple ncs2 devices, multiple ExecutableNetwork instances need to be used.

 

0 Kudos
Hyodo__Katsuya
Innovator
982 Views
@Yuanyuan L. (Intel) Thank you for always giving me the precise advice. I seem to have made a big mistake. I will immediately generate multiple "ExecutableNetwork" and review it to share Queue.
0 Kudos
Hyodo__Katsuya
Innovator
982 Views
@Yuanyuan L. (Intel) Thanks to you I got about 4 times better performance. However, as the timing of displaying inference results is biased, the look is not beautiful. I will try to adjust just a little more.
0 Kudos
Hyodo__Katsuya
Innovator
982 Views
Hello. With my power the following performance was the limit... Full size YoloV3 NCS x4 boosted about 13 FPS - Github https://github.com/PINTO0309/OpenVINO-YoloV3/blob/master/openvino_yolov3_MultiStick_test.py - Youtube https://youtu.be/AT75LBIOAck
0 Kudos
RTasa
New Contributor I
982 Views
I have a question for you. Are you trying to get an inference for every video frame coming in at 25 or 30fps? What would happen if you evaluated 1/2 the frames or 1/3. In the application you could display every frame but only get the inference for every other or every third frame coming in. Would you playback at full speed then?
0 Kudos
Hyodo__Katsuya
Innovator
982 Views
@Bob T. >Are you trying to get an inference for every video frame coming in at 25 or 30fps? No. However, the first posted video was doing what you asked. >What would happen if you evaluated 1/2 the frames or 1/3. The last post skips a certain number of frames to infer frames. >In the application you could display every frame >but only get the inference for every other or every third frame coming in. I have improved as such. The dance called Capoeira is a very slow movement, so it seems that the movie is moving slowly, but it's equal speed. Video playback and inference are performed asynchronously and the playback situation is as follows. Movie playback = Full frame (30 FPS) Inference = A part of frame (13 FPS) Inference time per frame = abaout 600 ms - 800 ms
0 Kudos
RTasa
New Contributor I
982 Views
inference time per frame = about 600 ms - 800 ms? Something doesn't seem right. That is less than 2 fps per stick. Even with 4 sticks that would be less than 8 fps of total inference. Hey Intel, "Why is inference is so slow on these Movidius hardware devices ?
0 Kudos
Hyodo__Katsuya
Innovator
982 Views
@Bob T. >That is less than 2 fps per stick. >Even with 4 sticks that would be less than 8 fps of total inference. Your idea is correct. In order to compensate for the delay of inference, I conducted parallel inference to the limit. 4 Stick (4 Thread) × 4 Request = 16 Parallel Python's multithreading is difficult to completely asynchronous due to the problem of Global Interpreter lock (GIL). So, the performance simply does not become 16 times. Thread switching overhead is serious. Actually, I'd like to implement it entirely with MultiProcess, but the OpenVINO API does not seem to correspond to MultiProcess.
0 Kudos
RTasa
New Contributor I
982 Views
This is where C++ works so much better. Multi-threading and messaging and controlling a input and output que. If I can get Open Vino installed on my Atom boards I will see what they can do. So far all I get is an error.
0 Kudos
Hyodo__Katsuya
Innovator
982 Views
@Bob T. Regrettably, I can hardly write C ++ programs... >So far all I get is an error. What kind of error is displayed at what timing?
0 Kudos
RTasa
New Contributor I
982 Views
When its doing the check to see if everything is installed. It looks for http://packages.ros.org/unbuntu xenial I think and it fails with a 404 looking for amd64. I am not on the machine and have not looked.
0 Kudos
Peniak__Martin
Beginner
982 Views

Seems too slow indeed. I am getting around 40FPS (mobilnet-ssd) on mini-PCIe Myriad X card plugged into Up board:

https://www.timeless.ninja/blog/the-world-s-first-ai-edge-camera-powered-by-up-squared-and-three-intel-myriad-x-vpus

and around 20FPS (mobilnet-ssd) on RPi with NCS 2...I run two models and one is slower so the total FPS right now running both models in parallel is around 12 FPS I think...that's two models on two sticks. If I update the slower model then the speed should be near 20 FPS for both models but right now the inference results are delayed by the slower model.

https://timeless.ninja/blog/the-world-s-first-ai-edge-camera-powered-by-two-intel-myriad-x-vpus

 

Hope this helps

0 Kudos
RTasa
New Contributor I
982 Views
The PI is extremely bottlenecked unless you have a fix for that. Using the Movidius's on SDK with 2 NCS sticks delivers about 8 to 12fps maybe. I am not sure how you are getting 20 on a PI. On the UpBoard (not Up2) using a single NCS2 stick I am getting better than PI performance.I will post later.
0 Kudos
Reply