- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Intel Experts,
I am trying to run the following OpenVINO code on an Intel machine. But when I run a matrix multiplication with (8192 by 8192) times (8192 by 8192) on NPU, the program stalls on line:
auto compiled_model = core.compile_model(model, device_string.c_str());
But it doesn't have the same issue with CPU nor GPU. When I reduce the matrix size to (4096 by 4096) times (4096 by 4096) on NPU, it works fine on the NPU. Is there away to query to memory available on NPU via OpenVINO? Do you have any debugging tool that you would like to recommend me to try? Thanks in advance!
Regards,
-yanny
PS: Below are the computer specs and the code. I also attached the cmake file in this ticket.
Computer specs:
Lunar Lake Client Platform
Processor Intel(R) Core(TM) Ultra 9 288V, 3300 Mhz, 8 Core(s), 8 Logical Processor(s)
Installed RAM 32.0 GB (31.6 GB usable)
System type 64-bit operating system, x64-based processor
OS Name Microsoft Windows 11 Pro
Version 10.0.26100 Build 26100
GPU 0
Intel(R) Arc(TM) 140V GPU (16GB)
Driver version: 32.0.101.6299
Driver date: 11/15/2024
DirectX version: 12 (FL 12.1)
Physical location: PCI bus 0, device 2, function 0
Shared GPU memory 18.0 GB
GPU Memory 18.0 GB
NPU 0
Intel(R) AI Boost
Driver version: 32.0.100.3104
Driver date: 10/25/2024
DirectX version: 12 (FL 1.0 : Compute)
Physical location: PCI bus 0, device 11, function 0
Shared GPU memory 18.0 GB
GPU Memory 18.0 GB
Code:
#include <iostream>
#include <openvino/openvino.hpp>
#include <openvino/op/matmul.hpp>
ov::InferRequest inference(ov::Core & core,
const std::shared_ptr<const ov::Model>& model, int shape, const std::string & device_string,
const std::shared_ptr<ov::op::v0::Parameter>& input_A,
const std::shared_ptr<ov::op::v0::Parameter>& input_B,
const ov::Shape& input_shape_A,
const ov::Shape& input_shape_B,
std::vector<float>& input_data_A,
std::vector<float>& input_data_B) {
auto compiled_model = core.compile_model(model, device_string.c_str());
auto infer_request = compiled_model.create_infer_request();
infer_request.set_tensor(input_A, ov::Tensor(ov::element::f32, input_shape_A, input_data_A.data()));
infer_request.set_tensor(input_B, ov::Tensor(ov::element::f32, input_shape_B, input_data_B.data()));
return infer_request;
}
void print_tensor(const ov::Tensor & tensor)
{
const float* output_data = tensor.data<float>();
std::cout << "result: ";
for (size_t i = 0; i < tensor.get_size(); ++i) {
std::cout << output_data[i] << " ";
}
std::cout << std::endl;
}
int main() {
try
{
ov::Core core;
constexpr int shape = 4096 * 2;
ov::Shape input_shape_A{ shape, shape };
ov::Shape input_shape_B{ shape, shape };
auto input_A = std::make_shared<ov::op::v0::Parameter>(ov::element::f32, input_shape_A);
auto input_B = std::make_shared<ov::op::v0::Parameter>(ov::element::f32, input_shape_B);
bool transpose_A = false;
bool transpose_B = true;
auto matmul_op = std::make_shared<ov::op::v0::MatMul>(input_A, input_B, transpose_A, transpose_B);
auto result = std::make_shared<ov::op::v0::Result>(matmul_op);
ov::ParameterVector inputs{ input_A, input_B };
ov::ResultVector results{ result };
auto model = std::make_shared<ov::Model>(results, inputs, "MatMulModel");
std::vector<float> input_data_A(shape * shape, 2.0f);
std::vector<float> input_data_B(shape * shape, 4.0f);
std::vector<ov::InferRequest> infer_requests;
infer_requests.emplace_back(inference(core, model, shape, "NPU", input_A, input_B, input_shape_A, input_shape_B, input_data_A, input_data_B));
for (auto& infer_request : infer_requests) {
infer_request.infer();
}
std::vector<ov::Tensor> output_tensors;
for (auto& infer_request : infer_requests) {
infer_request.wait();
output_tensors.emplace_back(infer_request.get_output_tensor());
}
for (const auto& output_tensor : output_tensors) {
print_tensor(output_tensor);
}
}
catch (std::exception e)
{
std::cout << "Exception: " << e.what() << std::endl;
}
return 0;
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Yanny,
Thank you for reaching out.
Compared to CPUs and GPUs, NPUs have the smallest onboard memory, and a large matrix multiplication consumes huge memory. Thus, reducing the matrix size resolves the issue.
Currently, to check the memory usage of an NPU is only from the task manager.
Regards,
Zul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your question. If you need any additional information from Intel, please submit a new question as this thread is no longer being monitored.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page