OpenVINO memory sharing for NPU on LNL

dinghao1 · ‎03-06-2025

Hi,

I am working on the interop between GPU and NPU on LNL using Windows. Based on the NPU remote tensor plugin documentation, it appears that the plugin only supports creating an L0 context from the OpenVINO core or a compiled model rather than accepting an existing context like the GPU device does.

My question is: If I want to share SYCL memory on the device with the NPU for inference, does it mean I need to use the NPU remote tensor plugin to create an L0 context, and based on this context to malloc shared memory for computing, and finally pass it back to OpenVINO for inference? Or is there an alternative approach or best practice to achieve this?

Thank you!

Zulkifli_Intel · ‎03-07-2025

Hi dinghao1,

Thank you for reaching out to us.

Primarily, if you want to share SYCL memory with the NPU for inference in OpenVINO, the method involves using the NPU Remote Tensor Plugin to create a L0 context, allocating shared memory, and passing it back to OpenVINO for inference. Let me confirm this approach and check if there is another alternative with the engineering team and get back to you.

Regards,

Zul

dinghao1 · ‎03-10-2025

Hi Zul,

Thank you for looking into this.

I still have two more questions:

For the shared memory allocated from the NPU L0 context, if I want to access GPU memory in another context, does it mean I still need to copy the memory to the shared buffer?
I am attempting to create a SYCL context from the NPU L0 context (refer to this). My code is as follows:

    auto npu_context = compiled_model.get_context().as<ov::intel_npu::level_zero::ZeroContext>();
    ze_context_handle_t ze_ctx = static_cast<ze_context_handle_t>(npu_context.get());
    sycl::platform sycl_plat;
    std::vector<sycl::device> sycl_devices;
    for (const auto& p : sycl::platform::get_platforms()) {
        if (p.get_backend() == sycl::backend::ext_oneapi_level_zero) {
            sycl_plat = p;
            sycl_devices = p.get_devices();
            break;
        }
    }

    sycl::backend_input_t<sycl::backend::ext_oneapi_level_zero, sycl::context> ctx_input{
        ze_ctx,
        sycl_devices,
        sycl::ext::oneapi::level_zero::ownership::keep
    };
    sycl::context sycl_ctx = sycl::make_context<sycl::backend::ext_oneapi_level_zero>(ctx_input);

However, this approach throws an exception. Do you know the correct way to achieve this?

I appreciate your help and look forward to your reply.

Regards,

Dinghao

Zulkifli_Intel · ‎03-24-2025

Hi dinghao1,

I just received feedback from the developer. The SYCL is a wrapper over OpenCL. The Level Zero API and OpenCL API allow to share the memory using dma-buf (on the Linux platform) or nt handle (on Windows platform). Users can import and export them. Here you can find details on how to share such a memory with the NPU through the remote tensor feature.

Indeed, if you create a host level zero tensor from NPU, it can be used without memory on NPU only in the same Level Zero Context that was used for creating it. Please note that you need to use the same ov::core object for using the same Level Zero context. Creating different ov::core objects will just create different Level Zero Contexts.

Here are some recommendations,

1. Use OpenVINO's NPU Context: The NPU plugin requires using the Level Zero (L0) context provided by its Obtain this context from the compiled model:

auto npu_context = compiled_model.get_context().as<ov::intel_npu::level_zero::ZeroContext>();
ze_context_handle_t ze_ctx = npu_context.get();

Ensure Correct SYCL Device Selection: When creating the SYCL context, ensure the devices correspond to the NPU. Filter devices by type or name:

sycl::device npu_sycl_dev;
for (const auto& dev : sycl::device::get_devices(sycl::info::device_type::all)) {
    if (dev.is_accelerator() && /* check for NPU name if possible */) {
        npu_sycl_dev = dev;
        break;
    }
}
std::vector<sycl::device> sycl_devices = {npu_sycl_dev};

Create SYCL Context with L0 Handle:

sycl::backend_input_t<sycl::backend::ext_oneapi_level_zero, sycl::context> ctx_input{
    ze_ctx,
    sycl_devices,
    sycl::ext::oneapi::level_zero::ownership::keep  // Use 'keep' as OpenVINO manages the context
};
sycl::context sycl_ctx = sycl::make_context<sycl::backend::ext_oneapi_level_zero>(ctx_input);

2. Memory Allocation and Sharing

Allocate Memory via NPU Context: Use the L0 context from OpenVINO to allocate shared memory accessible by both SYCL and NPU:

// Example using SYCL USM allocation on the NPU context
void* shared_mem = sycl::malloc_shared(size, npu_sycl_dev, sycl_ctx);

Wrap as Remote Tensor: Use OpenVINO's Remote Tensor API to pass this memory for inference:

ov::RemoteTensor remote_tensor = compiled_model.create_remote_tensor();
// Configure remote_tensor to point to shared_mem using NPU-specific API

3. Cross-Device Memory Access (NPU ↔ GPU)

If the GPU and NPU share the same L0 context/driver and support cross-device memory, no copy is needed. This is hardware/driver-dependent. You can also use SYCL to copy between GPU and NPU buffers.

Regards,

Zul

Zulkifli_Intel · ‎04-04-2025

This thread will no longer be monitored since we have provided a solution. If you need any additional information from Intel, please submit a new question.