NCS2 issues: segfault, stuck, low speed, lack of resource usage statistics

SashaBelykh · ‎02-13-2019

Hello, engineering experts!

We have the installation with 4 NCS2, connected to USB 3.0 sockets of industrial PC with Ubuntu. I have modified object_detection_demo_yolov3_async and fed it by our trained yolov3 network. It has the .bin file of size 123.091.790 bytes and input 832x832px and it looks like more then one network could be loaded in one Movidius MA2485 as it has 4 Gbit LPDDR4 memory in-package, could not it?

Anyway it could be very useful to monitor NCS's resource usage.

At the moment I'm calling plugin.LoadNetwork, it loads somewhat for about 90 seconds (why so long?!). And frequently on 4th trial, It stucks for a while before throwing exception(example_log1.txt). Even if four trials finished successfully, plugin does not throw exception, doing some kind of prediction of resource usage, but stucks in trials to ping device (example_log2.txt).

In the first case I use successfully loaded requests to do inference in async manner and they work fine asynchronously, but still too slow.

But in the second scenario, application stucks forever and only can be terminated by sending Ctrl+C to the terminal.

Also, sometimes plugin crashes with segmentation fault (example_log_3.txt with gdb output).

Having plugin, which loads 123MB weights to NCS2 for about 90 seconds and 4 NCS2, we want them to start inference before the moment when all network instances have been loaded, but with every next successfully loaded NCS2.

The most obvious approach is to use InferencePlugin.LoadNetwork in async manner. Question for Intel developers would sounds like this: Do you have or plan to provide such implementation?

Kind regards
Akeksandr.

Shubha_R_Intel · ‎02-19-2019

Dearest Aleksandr:

First. Thanks for your patience !

There is a known stability issue with more than 2 sticks connected same time
Yolo V3 in R5 is not performance optimized (performance is still not bad as the network is quite heavy)
90 second on LoadNetwork is not spent on network loading but on model optimization for VPU. This time should become better in next release as well.

For “async LoadNetwork” calling LoadNetwork form different threads should do the job

SashaBelykh · ‎02-19-2019

Dear Shubha, thank you for your answer!

Reasonable answers lead to the following clarification question, though.

1. Hope them to be fixed in the next releases. By the way, did you have a chance to discover such issues with IEI Mustang-V100 accelerator board which bear 8 Myriad X chips?

2. Could you please kindly provide a short overview of object detection networks mostly optimized for VPU at the moment?

3. Sounds weird, because TensorFlow model is already "model optimized" providing ".bin/.xml" output. Despite my understanding that the most valuable property of the intermediate representation is its universal nature for every kind of plugins, I would prefer to have the option to do optimization for VPU, mentioned by you, offline and prepare "post-intermediate representation" ready to be loaded to VPU in order to avoid unwelcome delays before inference start.

4. I would intrusively follow up my question about NCS resource usage =) Do developers plan to provide that kind of statistics? As I understand it could be obtained even from the model optimized for VPU, discussed in the previous point. Questions could be as follows:
- how much SHAVEs utilized by the network of interest?
- how much internal memory consumed by its weights and interlayer results?
- which USB throughput would be utilized by exchanges between VPU and CPU during an inference?
etc.
I want to know at least, how many instances of the network could be loaded to the available number of Myriad devices and run simultaneously.

Best regards!

Shubha_R_Intel · ‎02-21-2019

Dearest Aleksandr. Answers inline.

1. Hope them to be fixed in the next releases. By the way, did you have a chance to discover such issues with IEI Mustang-V100 accelerator board which bear 8 Myriad X chips?

There are two ways to work with that card – with MyraidPlugin and with HDDLPlugin. That issue affects MyriadPlugin only. And HDDLPlugin is a SW solution designed specifically for that card (and similar cards from other vendors).

2. Could you please kindly provide a short overview of object detection networks mostly optimized for VPU at the moment?

Unfortunately we don’t have published performance numbers so there is no clear guidance. In general what matter the most for performance is a size of intermediate activation tensors. The smaller they are the better network is fitted to HW. As for detection heads SSD is better tested than others.

3. Sounds weird, because TensorFlow model is already "model optimized" providing ".bin/.xml" output. Despite my understanding that the most valuable property of the intermediate representation is its universal nature for every kind of plugins, I would prefer to have the option to do optimization for VPU, mentioned by you, offline and prepare "post-intermediate representation" ready to be loaded to VPU in order to avoid unwelcome delays before inference start.

There is an export()/import() API for that.

4. I would intrusively follow up my question about NCS resource usage =) Do developers plan to provide that kind of statistics? As I understand it could be obtained even from the model optimized for VPU, discussed in the previous point. Questions could be as follows:
- how much SHAVEs utilized by the network of interest?
- how much internal memory consumed by its weights and interlayer results?
- which USB throughput would be utilized by exchanges between VPU and CPU during an inference?
etc.

The problem with SHAVE/Memory usage kind of information is that there is almost nothing user can do based on it (for example if he somehow hacks implementation to use fewer SHAVEs then freed SHAVEs will still stay unused). So there is no plan to expose much.

I want to know at least, how many instances of the network could be loaded to the available number of Myriad devices and run simultaneously.

Right now it is up to 10 models or device memory capacity. And 2 inference threads can run simultaneously on single device (with same or different models). On application level 3-5 concurrent inference tasks per Myriad-X device is recommended to also overlap USB transfers with inference. For small models (>100 fps) batch can give significant fps increase.

Best regards!

SashaBelykh · ‎03-21-2019

Dear Shubha, thank you very much for such a detailed answer!

What am I going to do?

1. I receive Mustang-V100 accelerator boards, try them with mentioned MyraidPlugin and HDDLPlugin and give you feedback.

2. I get new Linux workstation with OpenVINO distribution and try to use import()/export() API and also give you feedback.

Best regards!

SashaBelykh · ‎05-13-2019

Dear Shubha, I have a piece of news.

I've received two Mustang-V100 accelerator cards (8 VPUs on each), updated OpenVINO distribution and launched my code on them. In general, I have obtained almost the same results:

- 100+ seconds for initialization since plugin.LoadNetwork() has been called;
- 3,8 seconds per image (832x832px, YoloV3)

I have also tried network.Export() and plugin.ImportNetwork() API but failed to find anything on the images by the imported network configuration. It seems that the exported model's structure differs from the original and I tried to modify the code to meet it, but still no success. I import network to the plugin, read out output and input info from imported model and they fit my expectations, then I run inference and something goes on the same time (as when I load network by CNNNetReader directly from IR) but there are no any results.

Another issue is related to accelerator cards and it's enumeration. Namely, I import the model from the exported file one by one in order to run 16 parallel inferences, but it fails after 8 networks loaded with the message about Duplicate ID. Possibly it is due usage of myriadPlugin instead of HDDLPlugin mentioned by you. Address switchers on the accelerators are in the different positions and 7-segment led indicators show 0 and 1 on the accelerators backside. Command line log attached.

Summary:

1. Where can I get a detailed guide about Export/Import API usage, particularly about the interpretation of the imported model and preparation of plugin, blobs, and request based on its characteristics?

2. Is there a way to manually select HDDLPlugin if I want to run inference on two or more accelerators?

SashaBelykh · ‎05-16-2019

UPDATE:
I have been confused by download links and accidentally had not updated my OpenVINO.

Now, when I shurely updated openvino to 2019 R1.0.1 version, I have the following results with the previously described code:

- plugin.LoadNetwork() loads network yolo v3 model with 832x832px input for 20.5 seconds, which is 5 times faster than earlier. Cool!
- inferences with yolo v3 takes 2600 ms v.s. 3800 ms on the previous release. Much better too!

Shubha_R_Intel · ‎05-16-2019

Dear Belykh, Aleksandr,

I am thrilled to hear this ! Thanks for reporting back about your success. It helps the OpenVino community tremendously when customers report back on their trials and tribulations, their successes and their failures.

Thanks for using OpenVino !

Shubha

SashaBelykh · ‎05-27-2019

Hello, Shubha, it's me again.

I still have an issue with 2 IEI's Mustang-V100-MX8 accelerators, each of them with 8 Myriad X VPUs.

I have set addressing switches on the accelerators to the different numbers: 0 and 3 at the moment.

After the successful USB enumeration, I can see 16 VPUs Windows PC Device Manager.

Then I start code, described in this topic with the option for 8 to 12 instances of the YoloV3 network with input 832x832px. On every next attempt, Movidius MyriadX device reconnects as VSC Loopback Device, one device for one instance. Inference works fine.
But when I try to call plugin.LoadNetwork() more times, plugin reports errors. At the moment it usually fails after 12 successfully loaded instances, while loading 13th one.

Logs attached.