Scaling Intel Neural Processing Unit (NPU) in AI Client Ecosystem, with DirectML on Windows MCDM

Thomas_Hannaford · ‎02-22-2024

Scaling Intel Neural Processing Unit (NPU) in AI Client Ecosystem, with DirectML on Windows MCDM (Microsoft Compute Driver Model) Architecture

By Rutvi Trivedi, Murali Ambati, and Jaskaran Singh Nagi

AI (Artificial Intelligence) scenarios on PC (Personal Computers) Client Ecosystem have grown significantly in the last few years and are expected to continue their accelerated growth trend with the advent of generative AI and Copilot. Customers are continuously enabling AI into their end user applications targeting different segments including collaboration, creative, gaming & productivity.

To deliver AI across these usages, Intel has been working to solve challenges that come with exponentially growing system requirements such as compute, memory bandwidth, memory capacity, etc. One of the more difficult tasks is enabling an AI software stack that addresses these challenges by enabling optimal use of hardware capabilities, while enabling an abstracted API (Application Programming Interfaces) interface for developers. This is exactly what Intel has done.

With the release of Intel Core Ultra processors, Intel has integrated a Neural Processing Unit (NPU) on every platform known as Intel AI Boost, we have embraced open standards like Open Neural Network eXchange (ONNX), and through a strong collaboration with Microsoft, we are delivering software and hardware that simplifies AI programmability at scale.

The rest of this blog is organized as follows: 1) we introduce Intel’s NPU architecture, 2) we delve into the software architecture components to enable the NPU with DirectML, and 3) summary.

Intel AI Boost: Introduction to Intel’s Neural Processing Unit

Intel’s NPU is a power-efficient AI accelerator integrated into every Intel Core Ultra processor. What makes Intel AI Boost so capable is its unique architecture set up as a pipeline with a mix of compute acceleration and data transfer capabilities.

For compute acceleration, Intel AI Boost features a scalable architecture of multiple tiles – Neural Compute Engines – packed with hardware acceleration blocks for compute-intensive AI operations like Matrix Multiplication, Convolution, etc. For general compute needs, the Neural Compute Engines feature Streaming Hybrid Architecture Vector Engines (SHAVE) for high performance parallel computing.

To take full advantage of the compute capacity available, Intel AI Boost equips features that support efficient data transfers to saturate compute, offering maximum performance. This includes DMA (Direct Memory Access) engines to shuttle the data between system memory DRAM (Dynamic Random Access Memory) and a software managed cache. Built-in device MMU (Memory Management Unit) plus IOMMU (Input-Output Memory Management Unit) support multiple simultaneous hardware contexts and provide security isolation between execution contexts as per MCDM (Microsoft Compute Driver Model) architecture.

All these components enable a powerful combination to deliver high-performance and efficient AI acceleration. But the magic happens in the software through compiler technology via MCDM architecture, which orchestrates and executes AI workloads in parallel by directing compute and data flow in a tiling fashion with built-in and programmable control flow. This also enables maximum compute utilization by executing primarily out of scratchpad SRAM, and minimizing data transfers between SRAM and DRAM, enabling optimal perf/power for AI workloads.

To enable the PC ecosystem to take advantage of Intel AI Boost, Intel partnered with Microsoft to establish a software architecture that can utilize Intel AI Boost hardware resources, re-thinking AI acceleration infrastructure to deliver high performance and high efficiency AI to the PC ecosystem. The result is DirectML API support for Intel AI Boost enabling developers to take advantage of all these hardware features described seamlessly.

Software Architecture for Intel AI Boost

Windows developers can take advantage of Intel AI Boost to accelerate their workloads on Intel Core Ultra processors, by leveraging ONNX Runtime APIs (Application Programming Interfaces), or DirectML directly.

Beneath the DirectML layer, Intel exposes the Intel AI Boost accelerator as a compute device, through the Microsoft Compute Driver Model (MCDM). This driver architecture was defined for compute-only devices like Intel’s NPU and enabled for the first time with Intel NPU. MCDM is a foundational component to enlighten DirectML with NPU acceleration device, which is one of the key steps in making DirectML and NPU design feasible.

The advantages of the MCDM architecture are that it leverages the WDDM (Windows Display Driver Model) compute architecture to expose NPU capabilities while leveraging the strengths of the WDDM framework in the OS (Operating System) for work scheduling, power management, and tooling for functional and performance debug. Intel’s NPU driver implements a D3D12 compliant user mode library to provide hardware acceleration for compute operators, optimal schedule generation for model execution, and workload submission to the device via MCDM kernel mode driver. By leveraging the DirectML abstraction above D3D12, Intel made magic by efficiently using the hardware components described in the previous section and hiding all this from developers to make developer experience seamless and scalable.

Here is a snapshot of Samsung Gallery app utilizing DirectML via MCDM architecture executing inference on NPU, captured through Task Manager

Summary:

Through these efforts, DirectML + NPU infrastructure was demonstrated for targeted key AI/ML Workloads. These efforts are providing options to the many customers who have their existing interfaces on top of DirectML with either CPU (Central Processing Units) or GPU (Graphics Processing Unit), enabling them to take advantage of NPU for power efficient and performant execution with minimal development or on-device overhead.

Another important benefit of this solution, as ML-capable NPUs and GPUs proliferate through the Client PC Ecosystem along with new ML-driven experiences, it will be important to be able to fully leverage all the HW acceleration available on a device. Using DirectML and MCDM enables opportunities for parallel execution and load balancing among CPU, GPU and NPU.

More on ORT DML and Intel’s NPU architecture breakdown and tooling will be coming soon.