OpenVINO GPU and NPU compile time cache

yanny · ‎02-26-2025

Hi Intel Experts,

I am running a test to measure the performance of GPU and NPU on OpenVINO. I noticed that the compile time on NPU is a lot higher than GPU, also the compiled model seems to be cached somewhere locally on my machine. Do you have any information on why does the NPU compile time is so much longer thang GPU? And where is the cached compiled model saved in the computer.

Thank you very much!

Regards,

-yanny

Here is the code:

#include <openvino/openvino.hpp> 
#include <openvino/op/matmul.hpp>

#include <iostream>
#include <chrono>

ov::InferRequest inference(ov::Core & core, 
    const std::shared_ptr<const ov::Model>& model, int shape, const std::string & device_string, 
    const std::shared_ptr<ov::op::v0::Parameter>& input_A,
    const std::shared_ptr<ov::op::v0::Parameter>& input_B,
    const ov::Shape& input_shape_A,
    const ov::Shape& input_shape_B,
    std::vector<float>& input_data_A,
    std::vector<float>& input_data_B) {
    auto compiled_model = core.compile_model(model, device_string.c_str());
    auto infer_request = compiled_model.create_infer_request();

    infer_request.set_tensor(input_A, ov::Tensor(ov::element::f32, input_shape_A, input_data_A.data()));
    infer_request.set_tensor(input_B, ov::Tensor(ov::element::f32, input_shape_B, input_data_B.data()));

    return infer_request;
}

int main(int argc, char* argv[])
{
    // args format:
    // ie: 
    //  CPU 1
    //  GPU 10
    //  NPU 10

    std::string device_string = "CPU";
    int iteration = 10;

    // just look at the last 2 parameter
    if (argc >= 2) {
        std::string value = argv[argc - 1];
        std::string key = argv[argc - 2];

        try {
            iteration = std::stoi(value);  // Convert string to int
            device_string = key;
        }
        catch (const std::exception& e) {
            std::cerr << "Exception: " << e.what() << std::endl;
        }
    }

    try
    {
        ov::Core core;

        constexpr int shape = 1024 * 4;
        ov::Shape input_shape{ shape, shape };

        auto input_A = std::make_shared<ov::op::v0::Parameter>(ov::element::f32, input_shape);
        auto input_B = std::make_shared<ov::op::v0::Parameter>(ov::element::f32, input_shape);

        constexpr bool transpose_matrix_first = false;
        constexpr bool transpose_matrix_second = false;

        auto final_matmul = std::make_shared<ov::op::v0::MatMul>(input_A, input_B, transpose_matrix_first, transpose_matrix_second);

        for (int i = 1; i < iteration; i++) {
            final_matmul = std::make_shared<ov::op::v0::MatMul>(final_matmul, input_B, transpose_matrix_first, transpose_matrix_second);
        }

        auto result = std::make_shared<ov::op::v0::Result>(final_matmul);

        ov::ParameterVector inputs{ input_A, input_B };
        ov::ResultVector results{ result };
        auto model = std::make_shared<ov::Model>(results, inputs, "MatMulModel");

        std::vector<float> input_data_A(shape * shape);
        std::vector<float> input_data_B(shape * shape);

        // This is on CPU
        input_data_A[0] = 1.f;
        input_data_B[0] = 2.f;
        constexpr float value_inc_A = 0.001f;
        constexpr float value_inc_B = 0.01f;
        for (size_t i = 1; i < input_data_A.size(); ++i) {
            input_data_A[i] = input_data_A[i - 1] + value_inc_A;
            input_data_B[i] = input_data_B[i - 1] + value_inc_B;
        }

        std::cout << "device: " << device_string << ", iteration = " << iteration << ", shape = [" << shape << "x" << shape << "]" << std::endl;

        auto timer_compile = std::chrono::high_resolution_clock::now();
        std::vector<ov::InferRequest> infer_requests;
        infer_requests.emplace_back(inference(core, model, shape, device_string, input_A, input_B, input_shape, input_shape, input_data_A, input_data_B));
        auto time_elapsed_compile = std::chrono::high_resolution_clock::now() - timer_compile;
        auto time_elapsed_compile_ms = std::chrono::duration_cast<std::chrono::milliseconds>(time_elapsed_compile);
        std::cout << "Model compile time: " << time_elapsed_compile_ms.count() / 1000.0 << " seconds" << std::endl;

        auto timer_infer = std::chrono::high_resolution_clock::now();
        for (auto& infer_request : infer_requests) {
            infer_request.infer();
        }

        std::vector<ov::Tensor> output_tensors;
        for (auto& infer_request : infer_requests) {
            infer_request.wait();
            output_tensors.emplace_back(infer_request.get_output_tensor());
        }
        auto time_elapsed_infer = std::chrono::high_resolution_clock::now() - timer_infer;
        auto time_elapsed_infer_ms = std::chrono::duration_cast<std::chrono::milliseconds>(time_elapsed_infer);
        std::cout << "Model inference time: " << time_elapsed_infer_ms.count() / 1000.0 << " seconds" << std::endl;
    }
    catch (const std::exception& e) {
        std::cerr << "Exception: " << e.what() << std::endl;
    }

    return 0;
}

Result:

.\openvino-matmul.exe NPU 2
device: NPU, iteration = 2, shape = [4096x4096]
Model compile time: 121.245 seconds
Model inference time: 0.225 seconds

.\openvino-matmul.exe NPU 2
device: NPU, iteration = 2, shape = [4096x4096]
Model compile time: 0.247 seconds
Model inference time: 0.201 seconds

.\openvino-matmul.exe NPU 3
device: NPU, iteration = 3, shape = [4096x4096]
Model compile time: 216.919 seconds
Model inference time: 0.282 seconds

.\openvino-matmul.exe NPU 3
device: NPU, iteration = 3, shape = [4096x4096]
Model compile time: 0.329 seconds
Model inference time: 0.259 seconds

.\openvino-matmul.exe NPU 3
device: NPU, iteration = 3, shape = [4096x4096]
Model compile time: 0.328 seconds
Model inference time: 0.264 seconds

.\openvino-matmul.exe GPU 2
device: GPU, iteration = 2, shape = [4096x4096]
Model compile time: 0.27 seconds
Model inference time: 0.049 seconds

.\openvino-matmul.exe GPU 2
device: GPU, iteration = 2, shape = [4096x4096]
Model compile time: 0.203 seconds
Model inference time: 0.043 seconds

.\openvino-matmul.exe GPU 3
device: GPU, iteration = 3, shape = [4096x4096]
Model compile time: 0.208 seconds
Model inference time: 0.051 seconds

.\openvino-matmul.exe GPU 3
device: GPU, iteration = 3, shape = [4096x4096]
Model compile time: 0.205 seconds
Model inference time: 0.05 seconds

Aznie_Intel · ‎02-27-2025

Hi Yanny,

Thanks for reaching out. NPU compilation in OpenVINO takes longer than GPU because it uses Ahead-of-Time (AOT) compilation, applies more graph optimizations, and generates hardware-specific kernels. Unlike GPUs, which rely on precompiled OpenCL kernels, NPUs require additional processing to optimize execution.

To avoid recompilation, OpenVINO caches compiled models in:

Windows: C:\Users\<your_username>\.cache\blob_cache
Linux/macOS: ~/.cache/blob_cache

You may run the command below to check the cache location in OpenVINO:

from openvino.runtime import Core

ie = Core() cache_path = ie.get_property("GPU", "CACHE_DIR") # Change "GPU" to "NPU" if needed

print(f"OpenVINO Model Cache Directory: {cache_path}"

Regards,

Aznie

Aznie_Intel · ‎03-13-2025

Hi Yanny,

This thread will no longer be monitored since we have provided a solution. If you need any additional information from Intel, please submit a new question.

Regards,

Aznie