The count setting of streams and requests for NPU

JohnnyWang1992 · ‎01-22-2024

We want to check the count setting of streams and requests for each device in multi-threads ( refer to : Object Detection C++ Demo — OpenVINO™ documentationCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboard — Version(2023.2)).

For GPU can be set requests as 1, 2, 4.

For NPU just can be set requests as 1, when we set it as 2, just one request can work.

For further investigation, we use get_property to get the information for stream, request, but we cannot get the needed information for NPU.

Megat_Intel · ‎01-23-2024

Hi JohnnyWang1992,

Thank you for reaching out to us.

We are checking with the relevant team regarding this issue. We will get back to you once we receive any feedback from them. Thank you for your patience.

Regards,

Megat

Haarika · ‎02-04-2024

Helllo JohnnyWang1992,

The inference execution through the NPU plugin is entirely offloaded to the NPU device, no processing occurs on the CPU.
The MTL firmware does not support real HW concurrency ( executing multiple inferences in parallel on different tiles)

Because of the above, the NPU plugin forces NPU_STREAMS=1.

However, multiple inference requests can still be triggered concurrently to improve the throughput of the application.

The recommendation for this is to create multiple inference requests from the application.

The object detection demo supports this argument: "-nireq ": "Optional. Number of infer requests. If this option is omitted, number of infer " "requests is determined automatically.";[ https://github.com/openvinotoolkit/open_model_zoo/blob/master/demos/object_detection_demo/cpp/main.cpp#L403|https://github.com/openvinotoolkit/open_model_zoo/blob/master/demos/object_detection_demo/cpp/main.cpp#L403]

As mentioned in the source code, the application already queries the NPU plugin to get the optimal_number_of_inference_requests. By default, NPU plugin will attempt to minimize latency and will return "1", but this behavior can be modified through the ov::hint::performance_mode property. When set to THROUGHPUT instead of LATENCY (default), plugin will return "4" as the optimal number. Applications might not expose this config in the command line however and few changes in the source code might be required.

Can you please try the below recommendations -

Explicit nireq=1,2,4 provided as an argument to the demo
ov::hint::performance_mode=THROUGHPUT provided in the config to core.compile_model()

Please let me know your results.

Thanks

Haarika M

JohnnyWang1992 · ‎02-06-2024

We have tried the multi-request for NPU, if we set the nireq as 1, all of the results are correct, but once we set it more than 1 (like 2), only half of the results are correct, the other ones are wrong, we suspect that the dispatching for NPU request got something wrong.

We also tried to query related attributes “optimal_number_of_infer_requests” in auto device mode, the result is always 1. Even we modified the LATENCY, THROUGHPUT, there is no change. Would like to know the recommended settings for the number of npu, gpu, stream, and request (in throughput mode). For example, GPU, a stream, corresponds to 2 requests.

Megat_Intel · ‎02-21-2024

Hi JohnnyWang1992,

To investigate this issue further, could you please provide us with more information?

What is the model that you used when running the Object Detection C++ Demo and could you provide us the results that you received which show only half the result is correct.

Regards,

Megat

Megat_Intel · ‎03-04-2024

Hi JohnnyWang1992,

Thank you for your question. If you need any additional information from Intel, please submit a new question as this thread is no longer being monitored.

Regards,

Megat