best practice for inference on many small models

brian2 · ‎03-31-2025

Hi Intel & Community

Im developing a product where we have many cameras connected to a single system.
We have small quantified image classification convolutional models, which are able to
infer in 2-4 ms pr image on CPU, NPU and GPU.

Im currently using a Intel Core Ultra 9 285K on a linux based system using c++ openvino for inference. I have 8 different models running in approx realtime, 4 on the NPU and 4 on the CPU.
It all works smoothly.

My problem is that I want to be able to run on the iGPU in the background. I want it to constantly test, say, 50 other models and evaluate the outputs and change the NPU and CPU models on the fly, based on this analysis.

I have tried several approaches:

- Load all 50 models and keep infer requests in a vector :: Works BUT spawns more than 1000 extra threads. NOT GOOD.

- Load one model at a time, and discard it when done. Works BUT load time is approx +100ms pr model which sums up to +5seconds, which is NOT okay for my scenario.

- Loading many models onto the CPU : FAILS. Ending in lots of threads AND memory leak.

For these approaches im, attempting, to use cache:
core.set_property(ov::cache_dir("/the/dir/cached"));
ov::CompiledModel compiled_model= core.compile_model(xml, "GPU", ov_config);

So How can I handle many small models (+50) concurrently?
Please describe a design or link me to relative similar documentation on running many small models.

Please to ask if you need more info. Thank You and have a Nice one!

Wan_Intel · ‎03-31-2025

Hi brian2,

Thank you for reaching out to us.

I am checking with relevant team, and we will get back to you as soon as we have an update. If you have any additional details that might help with our investigation, please feel free to share them here.

Best regards,

Wan

Echo9Zulu · ‎04-02-2025

So I don't know the c++ side of OpenVINO very well yet. However I do a lot work with the Python API for both Optimum-Intel and some OpenVINO GenAI. I don't know much c++ yet. IF you are interested, check out my project OpenArc and maybe join the Discord linked in the repo.

What options are you passing through ov_config and have you looked at the async apis?

Check this out https://docs.openvino.ai/2025/api/c_cpp_api/classov_1_1_infer_request.html

brian2 · ‎04-03-2025

Thank you for input .. and nice repo you got, added it to 'stuff to look into when ..'.

Im using the simple and general ov_config:
{{ov::hint::performance_mode.name(), ov::hint::PerformanceMode::THROUGHPUT}};
and call
core.set_property(ov::cache_dir(..))

before the
ov::CompiledModel compiled_model = core.compile_model(xml, device, ov_config);
call.

I use async for CPU, NPU and GPU (all devices on the intel core ultra, none external).

My processing of the image data flow is done like (reduced formulation, no error and sanity checks in this copy:)

void InferenceHandler::process(DataStructs::DataPrepared &Data) {
    // next model request
    size_t index = next_index.fetch_add(1, std::memory_order_relaxed) % model_descriptor.num_requests;
    // wait for model to be ready for new inputs
    requests[index]->infer_request.wait();
    // mem copy input image data to model input
    auto &tensor = requests[index]->input_tensor;
    std::memcpy(tensor.data(), Data.Image.data, Data.Image.total() * Data.Image.elemSize());
    // assign callback with the current index
    auto use_index = index;
    requests[index]->infer_request.set_callback([&, use_index](std::exception_ptr ex) {
	// get output
	ov::Tensor output = requests[use_index]->infer_request.get_output_tensor();
	// copy output to mat
	cv::Mat pmap;
	tensor_to_mats_uint8_batch1(output, pmap);
	// move it along
	pmap_collector->push(pmap);
	requests[use_index]->active.store(false);
    });
    requests[index]->active.store(true);
    requests[index]->infer_request.start_async();
}

I have a running system with 2-4 models on both CPU and NPU. And its running well.
My problem is that I would like to handle a larger cache of models and change them dynamically. My models have a very small footprint (2-5Mb) and low inference time (1-4ms), intentionally so I can run multiple models on the same input.
But loading many models (>10) on NPU or CPU causes trouble (openvino mem leaks) and loading many models (say 50) onto the GPU seems to work initially, but an insane amount of threads are spawned (approx 2000).

So I would like to be able to ask a library of models to perform inference on a batch of images. I can do it, but the latency in loading a single model and unloading it etc is way too high.
Maybe someone is doing this in a sound and managable way?
Or maybe I should use an external discrete GPU?

Good day there

brian2 · ‎04-03-2025

.. also ..
I tried to increase the input size of the images for the GPU from 256x256 to 512x256 and 1024x256.
When changing the size I also change the model to a model which is converted and constructed for this size.
It works using the CPU but on the GPU the inference results are totally way off.

It results in all zeros .. And im not getting any warnings or errors.