infer_request.Infer() runtime accumulate

Albi_KA · ‎06-29-2021

I use the C++-API with the InferenceEngine.

In my project there is class that have the following variables:

InferenceEngine::Core core;
InferenceEngine::CNNNetwork network;
InferenceEngine::InputInfo::Ptr input_info;
InferenceEngine::DataPtr output_info;
InferenceEngine::ExecutableNetwork executable_network;
InferenceEngine::InferRequest infer_request;

In the class there is a function which does the inference:

...
infer_request.Infer();
...

Everything works fine but when I have more than one instance of the class then the runtime of Infer() increases. What is the reason for it?

Every instance has its own core, network, etc. so I thought the runtime should be constant.

Vladimir_Dudnik · ‎06-29-2021

@Albi_KA you usually would prefer to have single InferenceEngine::Core object, even if you run inference of several models. Although, several InferenceEngine::Core objects would work too, it is just not necessary. If you run inference of several models simultaneously on single device (for example on CPU), runtime may increase as by default Inference Engine will try to occupy the whole compute capabilities of device. You may control this by playing with number of streams and number of threads, which you would like to allocate for inference of each model.

Albi_KA · ‎06-29-2021

I use CPU. I have 16 cores, 32 logical core and 1 socket. I run about 10 inferences of several models simultaneously.

I try the following CPU configurations:

const std::map<std::string, std::string> config =
{ { InferenceEngine::PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS, InferenceEngine::PluginConfigParams::CPU_THROUGHPUT_AUTO},
  { InferenceEngine::PluginConfigParams::KEY_CPU_THREADS_NUM, "0"}, //0: all logical cores
  { InferenceEngine::PluginConfigParams::KEY_CPU_BIND_THREAD, "NUMA"} };

Unfortunately, there is no strong improvement against no configuration settings.

Which configuration settings should I use to improve the Infer() call?

Vladimir_Dudnik · ‎06-30-2021

@Albi_KA are you use this config for all models you run simultaneously? If that is the case, then each model you run try to use all logical cores, autodetected number of streams, which optimal for single model inference. I would try to play with smaller number of threads per model.

Albi_KA · ‎07-01-2021

Yes, I use the above configuration for all executable models that I run simultaneously.

I changed the value of the property InferenceEngine::PluginConfigParams::KEY_CPU_THREADS_NUM to "10", "20", "50", "100".

But only the value "0" has the "best" runtime. But I would like to improve the current runtime of Infer() with the value "0" for InferenceEngine::PluginConfigParams::KEY_CPU_THREADS_NUM.

Wan_Intel · ‎07-05-2021

Hi Bianca Lamm,

Thank you for reaching out to us.

Apart from KEY_CPU_THREADS_NUM, you can change the parameter values for KEY_CPU_BIND_THREAD, KEY_CPU_THROUGHPUT_STREAMS, and KEY_ENFORCE_BF16 to improve the runtime of infer().

For example, define a smaller number of threads: KEY_CPU_THREADS_NUM=1.

Details of Supported Configuration Parameters for CPU Plugin is available at the following page:

https://docs.openvinotoolkit.org/2021.4/openvino_docs_IE_DG_supported_plugins_CPU.html#supported_configuration_parameters

On another note, you can use the Post-training Optimization Tool (POT), which is designed to accelerate the inference of deep learning models by applying special methods without model retrained or fine-tuning, like post-training quantization.

Details of the Post-Training Optimization Tool is available at the following page:

https://docs.openvinotoolkit.org/2021.4/pot_README.html

Regards,

Wan

Wan_Intel · ‎07-13-2021

Hi Albi_KA,

This thread will no longer be monitored since we have provided suggestions.

If you need any additional information from Intel, please submit a new question.

Regards,

Wan