OpenVINO NPU stall in C++ for matrix multiplication of (8192 by 8192) times (8192 by 8192)

yanny · ‎02-05-2025

Hi Intel Experts,

I am trying to run the following OpenVINO code on an Intel machine. But when I run a matrix multiplication with (8192 by 8192) times (8192 by 8192) on NPU, the program stalls on line:

 auto compiled_model = core.compile_model(model, device_string.c_str());

But it doesn't have the same issue with CPU nor GPU. When I reduce the matrix size to (4096 by 4096) times (4096 by 4096) on NPU, it works fine on the NPU. Is there away to query to memory available on NPU via OpenVINO? Do you have any debugging tool that you would like to recommend me to try? Thanks in advance!

Regards,
-yanny

PS: Below are the computer specs and the code. I also attached the cmake file in this ticket.

Computer specs:

Lunar Lake Client Platform
Processor Intel(R) Core(TM) Ultra 9 288V, 3300 Mhz, 8 Core(s), 8 Logical Processor(s)
Installed RAM 32.0 GB (31.6 GB usable)
System type 64-bit operating system, x64-based processor
OS Name Microsoft Windows 11 Pro
Version 10.0.26100 Build 26100

GPU 0
 Intel(R) Arc(TM) 140V GPU (16GB)
 Driver version: 32.0.101.6299
 Driver date: 11/15/2024
 DirectX version: 12 (FL 12.1)
 Physical location: PCI bus 0, device 2, function 0
 Shared GPU memory 18.0 GB
 GPU Memory 18.0 GB

NPU 0
 Intel(R) AI Boost
 Driver version: 32.0.100.3104
 Driver date: 10/25/2024
 DirectX version: 12 (FL 1.0 : Compute)
 Physical location: PCI bus 0, device 11, function 0
 Shared GPU memory 18.0 GB
 GPU Memory 18.0 GB

Code:

#include <iostream>

#include <openvino/openvino.hpp> 
#include <openvino/op/matmul.hpp>

ov::InferRequest inference(ov::Core & core, 
    const std::shared_ptr<const ov::Model>& model, int shape, const std::string & device_string, 
    const std::shared_ptr<ov::op::v0::Parameter>& input_A,
    const std::shared_ptr<ov::op::v0::Parameter>& input_B,
    const ov::Shape& input_shape_A,
    const ov::Shape& input_shape_B,
    std::vector<float>& input_data_A,
    std::vector<float>& input_data_B) {

    auto compiled_model = core.compile_model(model, device_string.c_str());
    auto infer_request = compiled_model.create_infer_request();

    infer_request.set_tensor(input_A, ov::Tensor(ov::element::f32, input_shape_A, input_data_A.data()));
    infer_request.set_tensor(input_B, ov::Tensor(ov::element::f32, input_shape_B, input_data_B.data()));

    return infer_request;
}

void print_tensor(const ov::Tensor & tensor)
{
    const float* output_data = tensor.data<float>();

    std::cout << "result: ";
    for (size_t i = 0; i < tensor.get_size(); ++i) {
        std::cout << output_data[i] << " ";
    }
    std::cout << std::endl;
}

int main() {
    try
    {
        ov::Core core;

        constexpr int shape = 4096 * 2;
        ov::Shape input_shape_A{ shape, shape };
        ov::Shape input_shape_B{ shape, shape };

        auto input_A = std::make_shared<ov::op::v0::Parameter>(ov::element::f32, input_shape_A);
        auto input_B = std::make_shared<ov::op::v0::Parameter>(ov::element::f32, input_shape_B);

        bool transpose_A = false;
        bool transpose_B = true;

        auto matmul_op = std::make_shared<ov::op::v0::MatMul>(input_A, input_B, transpose_A, transpose_B);

        auto result = std::make_shared<ov::op::v0::Result>(matmul_op);

        ov::ParameterVector inputs{ input_A, input_B };
        ov::ResultVector results{ result };
        auto model = std::make_shared<ov::Model>(results, inputs, "MatMulModel");

        std::vector<float> input_data_A(shape * shape, 2.0f);
        std::vector<float> input_data_B(shape * shape, 4.0f);
        
        std::vector<ov::InferRequest> infer_requests;

        infer_requests.emplace_back(inference(core, model, shape, "NPU", input_A, input_B, input_shape_A, input_shape_B, input_data_A, input_data_B));

        for (auto& infer_request : infer_requests) {
            infer_request.infer();
        }

        std::vector<ov::Tensor> output_tensors;

        for (auto& infer_request : infer_requests) {
            infer_request.wait();
            output_tensors.emplace_back(infer_request.get_output_tensor());
        }

        for (const auto& output_tensor : output_tensors) {
            print_tensor(output_tensor);
        }
    }
    catch (std::exception e)
    {
        std::cout << "Exception: " << e.what() << std::endl;
    }

    return 0;
}

Zulkifli_Intel · ‎02-06-2025

Hello Yanny,

Thank you for reaching out.

Compared to CPUs and GPUs, NPUs have the smallest onboard memory, and a large matrix multiplication consumes huge memory. Thus, reducing the matrix size resolves the issue.

Currently, to check the memory usage of an NPU is only from the task manager.

Regards,

Zul

Zulkifli_Intel · ‎02-17-2025

Thank you for your question. If you need any additional information from Intel, please submit a new question as this thread is no longer being monitored.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

OpenVINO NPU stall in C++ for matrix multiplication of (8192 by 8192) times (8192 by 8192)