Solved: NCS2 Parallel Networks on one Device are Serialized?

Dominik · ‎09-21-2020

Hi everyone,

I successfully created a network that I can infer on the NCS2. Now I want to speed up the inference by using 5 networks in parallel on 1 NCS2 where each network should use 4 inference requests.

I do this the following way:

I spawn 5 threads where each creates an InferenceEngine::ExecutableNetwork for the MYRIAD device This should load 5 networks onto the NCS2.
In each of those threads I spawn 4 additional threads that create a synchronous inference request.

Everything runs without errors and also the output image is correct but I don't see any speedup. When I measure the computation time it's almost the same (140seconds +- 1s) as when I run only one network. It looks like only one network and one inference request is doing all the work.

Does anyone have similar issues. Maybe there is something wrong with my architecture?

Best Regards

Dominik

Rizal_Intel · ‎09-22-2020

Hi Dominik,

There seems to be no problem with your architecture based on your explanation.

There is only this requirement to parallelize the workload as much as possible.

Actually, there is only a single Movidius X chip on the NCS2. Therefore, inference calls are actually queued to be executed on the single chip.

You could check your parallel implementation with crossroad camera demo and action recognition demo.

To get an increase in performance you would need to have multiple NCS2 sticks (or Intel® Vision Accelerator Design). There is an example you can use a reference created by Victor Li for utilising multiple NCS2. This is based on the concept of MYRIAD device allocation.

Regards,

Rizal

View solution in original post

Rizal_Intel · ‎09-22-2020

Hi Dominik,

There seems to be no problem with your architecture based on your explanation.

There is only this requirement to parallelize the workload as much as possible.

Actually, there is only a single Movidius X chip on the NCS2. Therefore, inference calls are actually queued to be executed on the single chip.

You could check your parallel implementation with crossroad camera demo and action recognition demo.

To get an increase in performance you would need to have multiple NCS2 sticks (or Intel® Vision Accelerator Design). There is an example you can use a reference created by Victor Li for utilising multiple NCS2. This is based on the concept of MYRIAD device allocation.

Regards,

Rizal

Dominik · ‎09-25-2020

Thank you for you feedback. The time difference was just so little that I didn't recognize it at first but now it works fine.

BR
Dominik

Rizal_Intel · ‎09-30-2020

Hi Dominik,

Do you need any other additional information?

Regards,

Rizal

Rizal_Intel · ‎10-05-2020

Hi Dominik,

Intel will no longer monitor this thread since we have provided a solution. If you need any additional information from Intel, please submit a new question.

Regards,

Rizal