hidden text to trigger early load of fonts ПродукцияПродукцияПродукцияПродукция Các sản phẩmCác sản phẩmCác sản phẩmCác sản phẩm المنتجاتالمنتجاتالمنتجاتالمنتجات מוצריםמוצריםמוצריםמוצרים
Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.
6538 Discussions

OpenVINO NPU stall in C++ for matrix multiplication of (8192 by 8192) times (8192 by 8192)

yanny
Novice
686 Views

Hi Intel Experts,

I am trying to run the following OpenVINO code on an Intel machine. But when I run a matrix multiplication with (8192 by 8192) times (8192 by 8192) on NPU, the program stalls on line:

 auto compiled_model = core.compile_model(model, device_string.c_str());

But it doesn't have the same issue with CPU nor GPU.  When I reduce the matrix size to (4096 by 4096) times (4096 by 4096) on NPU, it works fine on the NPU.  Is there away to query to memory available on NPU via OpenVINO?  Do you have any debugging tool that you would like to recommend me to try?  Thanks in advance!

Regards,
-yanny

PS: Below are the computer specs and the code.  I also attached the cmake file in this ticket.

Computer specs:

Lunar Lake Client Platform
Processor Intel(R) Core(TM) Ultra 9 288V, 3300 Mhz, 8 Core(s), 8 Logical Processor(s)
Installed RAM 32.0 GB (31.6 GB usable)
System type 64-bit operating system, x64-based processor
OS Name Microsoft Windows 11 Pro
Version 10.0.26100 Build 26100
GPU 0
Intel(R) Arc(TM) 140V GPU (16GB)
Driver version: 32.0.101.6299
Driver date: 11/15/2024
DirectX version: 12 (FL 12.1)
Physical location: PCI bus 0, device 2, function 0
Shared GPU memory 18.0 GB
GPU Memory 18.0 GB
NPU 0
Intel(R) AI Boost
Driver version: 32.0.100.3104
Driver date: 10/25/2024
DirectX version: 12 (FL 1.0 : Compute)
Physical location: PCI bus 0, device 11, function 0
Shared GPU memory 18.0 GB
GPU Memory 18.0 GB

Code:

 

#include <iostream>

#include <openvino/openvino.hpp> 
#include <openvino/op/matmul.hpp>

ov::InferRequest inference(ov::Core & core, 
    const std::shared_ptr<const ov::Model>& model, int shape, const std::string & device_string, 
    const std::shared_ptr<ov::op::v0::Parameter>& input_A,
    const std::shared_ptr<ov::op::v0::Parameter>& input_B,
    const ov::Shape& input_shape_A,
    const ov::Shape& input_shape_B,
    std::vector<float>& input_data_A,
    std::vector<float>& input_data_B) {

    auto compiled_model = core.compile_model(model, device_string.c_str());
    auto infer_request = compiled_model.create_infer_request();

    infer_request.set_tensor(input_A, ov::Tensor(ov::element::f32, input_shape_A, input_data_A.data()));
    infer_request.set_tensor(input_B, ov::Tensor(ov::element::f32, input_shape_B, input_data_B.data()));

    return infer_request;
}

void print_tensor(const ov::Tensor & tensor)
{
    const float* output_data = tensor.data<float>();

    std::cout << "result: ";
    for (size_t i = 0; i < tensor.get_size(); ++i) {
        std::cout << output_data[i] << " ";
    }
    std::cout << std::endl;
}

int main() {
    try
    {
        ov::Core core;

        constexpr int shape = 4096 * 2;
        ov::Shape input_shape_A{ shape, shape };
        ov::Shape input_shape_B{ shape, shape };

        auto input_A = std::make_shared<ov::op::v0::Parameter>(ov::element::f32, input_shape_A);
        auto input_B = std::make_shared<ov::op::v0::Parameter>(ov::element::f32, input_shape_B);

        bool transpose_A = false;
        bool transpose_B = true;

        auto matmul_op = std::make_shared<ov::op::v0::MatMul>(input_A, input_B, transpose_A, transpose_B);

        auto result = std::make_shared<ov::op::v0::Result>(matmul_op);

        ov::ParameterVector inputs{ input_A, input_B };
        ov::ResultVector results{ result };
        auto model = std::make_shared<ov::Model>(results, inputs, "MatMulModel");

        std::vector<float> input_data_A(shape * shape, 2.0f);
        std::vector<float> input_data_B(shape * shape, 4.0f);
        
        std::vector<ov::InferRequest> infer_requests;

        infer_requests.emplace_back(inference(core, model, shape, "NPU", input_A, input_B, input_shape_A, input_shape_B, input_data_A, input_data_B));

        for (auto& infer_request : infer_requests) {
            infer_request.infer();
        }

        std::vector<ov::Tensor> output_tensors;

        for (auto& infer_request : infer_requests) {
            infer_request.wait();
            output_tensors.emplace_back(infer_request.get_output_tensor());
        }

        for (const auto& output_tensor : output_tensors) {
            print_tensor(output_tensor);
        }
    }
    catch (std::exception e)
    {
        std::cout << "Exception: " << e.what() << std::endl;
    }

    return 0;
}

 

 

0 Kudos
2 Replies
Zulkifli_Intel
Moderator
610 Views

Hello Yanny,

Thank you for reaching out.


Compared to CPUs and GPUs, NPUs have the smallest onboard memory, and a large matrix multiplication consumes huge memory. Thus, reducing the matrix size resolves the issue.

 

Currently, to check the memory usage of an NPU is only from the task manager.

 

 

Regards,

Zul


0 Kudos
Zulkifli_Intel
Moderator
350 Views

Thank you for your question. If you need any additional information from Intel, please submit a new question as this thread is no longer being monitored.


0 Kudos
Reply