Intel Labs Contributes Key Technologies to New Intel Core Ultra and Intel Xeon Scalable Processors

ScottBair · ‎12-14-2023

Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.

Highlights

Intel Labs contributed new technologies to Intel’s newly introduced Intel® Core™ Ultra mobile processor and the 5th Gen Intel® Xeon® Scalable processor.
The technical innovations from Intel Labs add to the overall improvements in performance, power efficiency, isolation and security, and mitigation against side-channel attacks in the new processors.
For the Intel Core Ultra processor, Intel Labs contributed datapath and register file circuit technologies in the NPU, digital linear voltage regulators for powering compute cores, side-channel resistant AES technology for EnDebug, hash-based signatures to increase resistance to quantum computing attacks, and the redesign of VT-d infrastructure to scale XPUs. In addition, Intel Labs developed AEX-Notify for the latest Intel Xeon Scalable processor.

Intel Labs contributed new technologies to Intel’s newly introduced Intel® Core™ Ultra mobile processor and the 5th Gen Intel® Xeon® Scalable processor, adding to the overall improvements in performance, power efficiency, isolation and security, and mitigation against side-channel attacks.

The Intel Core Ultra mobile processor will deliver reimagined power efficiency, leading performance, and new artificial intelligence (AI) PC experiences. Developed under the code name Meteor Lake, the new Intel Core Ultra processors will power more than 230 AI PCs, which represent a new generation of personal computers. With dedicated AI acceleration capability spread across the central processing unit (CPU), graphics processing unit (GPU), and neural processing unit (NPU) architectures, Intel Core Ultra is the most AI-capable and power-efficient client processor in Intel’s history.

Developed under the code name Emerald Rapids, the 5th Gen Intel Xeon Scalable processor delivers more compute and faster memory at the same thermal design power (TDP) as the previous generation. With AI acceleration in every core, Intel Xeon processors are ready to handle demanding AI workloads — including inference and fine-tuning on models up to 20 billion parameters — before adding discrete accelerators.

Intel Core Ultra Processor (Meteor Lake) Contributions

For the Intel Core Ultra processor, Intel Labs contributed datapath and register file circuit technologies in the NPU, digital linear voltage regulators for powering compute cores, side-channel resistant AES technology for EnDebug, hash-based signatures to increase resistance to quantum computing attacks, and the redesign of VT-d infrastructure to scale XPUs.

Datapath and Register File Circuit Technologies in Neural Processing Unit

The Intel Core Ultra NPU incorporates floating-point dot product datapath circuits based on research from Intel Labs. Machine learning workloads use deep neural networks with matrix multiplication and convolution as the key power and performance limiting operations. Multiple techniques, including fused multiply-accumulate with global alignment and entering a fixed-point adder tree, provide up to 10% area reduction and 30% power reduction for the NPU when compared to previous designs. Further optimizations such as local product alignment and speculative OR-tree based maximum exponent further improve circuit delay. Using some of the techniques based on research described in Optimized Fused Floating-Point Many-Term Dot-Product Hardware for Machine Learning Accelerators, the Intel Labs team worked jointly with the NPU team to evaluate and implement these improvements in the Meteor Lake NPU.

Data flow optimization is key to improving the energy-efficiency of the NPU. Data movement energy can dominate overall chip power consumption but can be mitigated by temporal and spatial reuse with smaller, local memories versus larger, more expensive levels of static random-access memory (SRAM). Therefore, an efficient design of these local register files is extremely important to reduce the data movement power. Within a processing element (PE) of the NPU, there are three types of small local register files with a varying number of ports, sizes, and banks that store input activations, weights, and output features. Cumulatively, this local register file power can account for a significant amount of PE power. Motivated by low-power synthesizable register files research originating from Intel Labs and described in 2.4GHz, Double-Buffered, 4kb Standard-Cell-Based Register File with Low-Power Mixed-Frequency Clocking for Machine Learning Accelerators, the Intel Labs team worked jointly with the NPU team to incorporate key circuit techniques such as write data bit-line data gating, optimized sequential address patterns, and bit-packed multi-bit cells in the Meteor Lake NPU. This resulted in a 26% reduction in PE register file power when compared to previous designs.

Digital Linear Voltage Regulators Powering Compute Cores

Today’s high-performance processors must be extremely energy efficient, and this requires dynamically switching between high-performance mode (for example, turbo) and low-power and standby modes as quickly as possible as workload demands change. This advanced capability is handled in large part by on-die voltage regulators (VRs), which provide power to various compute cores and accelerators in the system-on-a-chip (SoC). Past Intel products have incorporated a fully-integrated voltage regulator (FIVR), which uses inductors that are integrated directly into the package. However, as client systems get thinner and more compact, packages must slim down as well, making it no longer possible to integrate the inductors. Other types of linear VRs that don’t require package-integrated inductors have been used in the industry to power digital domains such as compute cores. However, they typically can’t achieve the fast speed required to support the wide power/performance range in today’s processors. As a result, the overall efficiency of the system suffers.

To address this challenge, Intel Labs developed and demonstrated an innovative digital and high-speed on-die linear voltage regulator that does not require any on-package components and delivers faster response than FIVR and other digital linear voltage regulators (DLVRs). This technology features a unique computational approach that results in more than 20x faster settling time (which is important when quickly changing between different performance modes) and better power efficiency for battery life savings. Intel Labs developed the initial silicon proof of concept of this technology and the results were published in some of the top conferences and journals in the circuit design field, including 2019 Symposium on VLSI Circuits and IEEE Journal of Solid State Circuits 2020. This DLVR technology is now integrated into the Intel Core Ultra processor, where it powers all compute cores (performance and efficient) enabling per-core dynamic voltage and frequency scaling (DVFS). This DLVR technology enables Core Ultra to deliver better performance/watt within the thin package limits, making the technology a key contributor to the most power-efficient client processor Intel’s ever made.

Side-Channel Resistant AES Technology for EnDebug Remote Debug Feature

EnDebug is a remote encrypted debug feature that enables a secure connection between an Intel test facility and a customer CPU die, allowing Intel engineers to perform debug and fault analysis on remote systems not in their immediate physical location. This may include machines housed in remote factory settings or at customer data centers. When the device under debug is located at a remote location, all communications between the debug test system and the CPU’s test access port (TAP) must be encrypted using the Advanced Encryption Standard (AES) to protect the valuable internet protocol (IP) contained in the debug streams. A secure debug session begins with authentication and key-exchange between the test facility and remote CPU, and then a secure channel is established between the TAP and trusted environment. Because the device is not in Intel’s physical possession, side-channel leakage in the form of current traces or electromagnetic (EM) emissions from the die can lead to a leakage of secret keys, which can expose Intel-only channels and features to a malicious attacker, prompting the need for a side-channel attack resilient AES implementation.

EnDebug provides a light-weight solution that does not impose significant area or power overheads on the die. Conventional side-channel resistant AES hardware accelerators incur 2x area/power overheads compared to an unprotected AES implementation. In contrast, Intel Core Ultra includes a low area overhead side-channel resistant AES using a novel heterogenous composite-field substitution box (Sbox) technology developed and prototyped at Intel Labs. The heterogenous composite-field AES technology implements a pair of heterogeneous Sboxes with a randomized dataflow, allowing an incoming data byte to be randomly processed by either one of the Sboxes every clock cycle. The random dataflow disrupts correlations between the measured power/EM traces and the attacker’s power model, providing a mitigation against correlation power attacks (CPA). This technology offers side-channel leakage protection against up to 12 million encryption traces, while reducing area overhead to 28% (3x lower than conventional side-channel resistant AES).

The specific choice of composite Galois-field arithmetic has a strong influence on the implementation of the Sbox module (the most area- and power-hungry block of AES) and significantly influences the circuit realizations of components like multipliers, squaring, and affine blocks. Explorations of various composite-field polynomial implementations indicate a 1.92x spread in power consumption between the maximum and minimum power polynomials.

Progress Toward Quantum Resistance in Intel Core Ultra

Quantum computing accelerates the computation of certain types of algorithms, giving it the potential to solve some of the world’s most intractable problems in materials science, chemical engineering, and more. This grand potential, however, comes with a caveat that cannot be ignored: Quantum computers will be able to break much of the cryptography that is currently used in our worldwide digital infrastructure. Classical public key cryptosystems like RSA (Rivest-Shamir-Adleman) will be broken by quantum computers compromising applications such as code signing and authentication.

Intel Labs is taking steps to protect our platforms from quantum attacks. We investigated hash-based signatures for increasing robustness of code signing and authentication. The security of hash-based signatures relies on well-known properties of hash algorithms such as preimage and collision resistance. XMSS (eXtended Merkle Signature Scheme) and LMS (Leighton-Micali Signature) are two hash-based signature algorithms standardized by the Internet Engineering Task Force (IETF) and approved by the National Institute of Standards and Technology (NIST). Intel Core Ultra is the first platform where Intel is starting the transition to make key Intel assets resilient to quantum attacks.

Redesign of VT-d Infrastructure in Intel Core Ultra Provides Highly Scalable Solution for Modern XPUs

Virtualization Technology for Directed-IO (VT-d 4.1) is a critical feature that provides isolation and security of platform memory from accelerators (XPUs) and other platform devices. To address the needs of emerging workloads, platforms are adding XPUs and becoming more heterogenous. Since all memory accesses from XPUs need to be checked or translated by an input–output memory management unit (IOMMU), prior platforms provided a dedicated IOMMU for each major XPU that required high performance. However, increasing the number of IOMMUs along with the number of XPUs is not a scalable or efficient solution for the long term.

Figure 1. Intel Core Ultra VTd.jpg

Figure 1. The hardware block that provides VT-d functionality (e.g., address translation) is shown in the diagram as IOMMU. With Intel Core, each high performance XPU has a dedicated IOMMU. With the redesign of Intel Core Ultra, each high performance XPU uses dedicated ATC (and PCIe ATS) to reach higher bandwidth with better area/power efficiency. The integrated GPU has special requirements and hence is serviced by a dedicated G-IOMMU. All other XPUs and devices share the D-IOMMU. The HPA isolation boundary is represented by the blue block surrounding the ATC.

The Intel Core Ultra platform rearchitected the VT-d infrastructure so that key XPUs get significantly higher address translation capability, which allows them to achieve significantly higher memory bandwidth. As part of this innovation, each XPU that had a dedicated IOMMU replaced its respective dedicated IOMMU with an address translation cache (ATC). The ATC uses the peripheral component interconnect express (PCIe) address translation service (ATS) protocol to communicate with the appropriate IOMMU in the IO-Agent and fetch/store required address translations.

ATS provides a scalable and distributed solution for meeting the address translation needs of modern XPUs. However, it comes with an inherent security concern. ATC hardware is considered part of the XPUs, and today’s system software does not trust the ATC as much as it trusts the IOMMU. System software may be concerned that a malicious or buggy ATC may use a physical address that is different from the physical address provided by the IOMMU (via ATS protocol) and attack system software. With Intel Core Ultra, root-complex integrated XPUs use a secure-by-design approach to implement a host physical address (HPA) isolation boundary.

The key properties of the HPA isolation boundary are:

All memory accesses from the XPU must go through the HPA isolation boundary, where appropriate address translation is provided by ATC.
The HPA isolation boundary does not provide any physical addresses to the XPU.

Root-complex integrated XPUs that implement an HPA isolation boundary are identified to system software via the system-on-a-chip integrated device property (SIDP) reporting structure. This approach mitigates security concerns and allows system software to enable ATS on root-complex integrated XPUs and achieve the ideal performance.

The VT-d architecture added support for scalable mode operation in revision 3.0. Many new features such as nested translations (enabling intra-VM isolation) and shared virtual memory (SVM) (enabling efficient collaboration between CPU and XPU) require scalable mode and are not supported in legacy mode. To provide software with a consistent view across Intel’s server and client platforms, Intel Core Ultra IOMMUs add support for scalable mode operation. This will allow software using scalable mode to run seamlessly on Intel’s server and client platforms, avoiding the need to maintain separate software stacks for the two platforms. The software consistency between server and client platforms will also help reduce time-to-market for original equipment manufacturers (OEMs) on features such as nested translation when they become available in future client products.

The VT-d architecture allows a virtual machine manager (VMM) to remap interrupts coming from XPUs. This gives VMMs flexibility in isolating interrupts by controlling attributes of interrupts from XPUs, and migrating interrupts when the target of the interrupt has migrated to a different physical CPU. Intel Core Ultra is the first client product to support the posted interrupt feature, which allows XPUs assigned to VMs to use a virtual vector space, greatly increasing the scalability of the limited interrupt vector space and improving interrupt processing performance. In the absence of posted interrupts, when the IOMMU receives an interrupt from an XPU, the interrupt is sent to the VMM for processing. This requires the VMM to inject the corresponding virtual interrupt into the VM. With the posted interrupt capability, the IOMMU hardware can:

Deliver the interrupt directly to the VM (without VMM intervention) if the VM is running.
Post the interrupt in a memory structure if the VM is not running (accumulating interrupts for later processing).

This policy of adapting to the VM’s state reduces overall interrupt latency and overhead associated with virtualizing interrupts.

5th Gen Intel Xeon Scalable Processor (Emerald Rapids) Contributions

Confidential computing with trusted execution environments (TEEs) helps protect data and AI models. Intel Labs has contributed AEX-Notify to Intel® Software Guard Extensions (Intel® SGX), which provides application isolation and enhances data protection in use.

AEX-Notify Hardware Feature in Intel SGX

Intel Labs contributed AEX-Notify to the latest Xeon Scalable processor. AEX-Notify is a new hardware feature in Intel SGX that allows Intel SGX enclaves to receive guaranteed notifications after asynchronous enclave exits, such as interruptions or exceptions. Intel SGX supports the creation of shielded enclaves within unprivileged processes. While enclaves are architecturally protected against malicious system software, Intel SGX’s privileged attacker model could potentially expose enclaves to side-channel attacks that use privileged timer interrupts to precisely step through enclave execution exactly one instruction at a time.

AEX-Notify makes enclaves interrupt aware: enclaves can register a trusted software handler to run after an interrupt or exception. AEX-Notify can be used as a building block for implementing countermeasures against different types of interrupt- or exception-based attacks in software. It addresses fine-grained execution control techniques that can be used to amplify side-channel attacks by providing a hook for software to intercept and respond to enclave exits. The Intel SGX SDK for Linux has been updated to incorporate an efficient AEX-Notify software handler to prevent the next enclave application instruction from being interrupted or triggering an exception. The effectiveness of AEX-Notify in mitigating attacks was published in AEX-Notify: Thwarting Precise Single-Stepping Attacks Through Interrupt Awareness for Intel SGX Enclaves, a research paper presented at the 2023 USENIX Security Symposium.

AEX-Notify is also available via platform updates on all previous generation Xeon CPUs that support Intel SGX.