Intel Labs Continues Focused Research and Standards Efforts to Make FHE Viable

Ro_Cammarota · ‎05-31-2023

Rosario Cammarota (Ro) is a principal engineer in the Emerging Security Lab at Intel Labs. He is the principal investigator for the DARPA DPRIVE program and Intel academic centers focusing on privacy, cryptography, and security mechanisms.

Highlights:

Intel Labs is midway through Phase 2 in the DARPA DPRIVE program to design HERACLES, a fully homomorphic encryption accelerator for overcoming computation overhead barriers.
Intel’s DARPA DPRIVE principal investigator presented the keynote at the 6th HomomorphicEncryption.org Standards Meeting in Seoul.
Intel and its industry and academic collaborators presented two homomorphic encryption research papers, and four security and privacy, and edge-AI papers at GOMACTech 2023.

Intel continues to make strides in overcoming the computational overhead barrier in fully homomorphic encryption (FHE). Intel is midway through Phase 2 in designing HERACLES, a FHE hardware accelerator for the Data Protection in Virtual Environments (DPRIVE) program, which is sponsored by the Defense Advanced Research Projects Agency (DARPA). Additionally, through its ongoing efforts to advance FHE standards and conduct other cutting-edge FHE research with industry and academic collaborators, Intel is striving to make homomorphic encryption a viable method for analyzing encrypted data without decryption.

Continuing Phase 2 of DARPA DPRIVE

Design efforts are well under way for Phase 2 in the DPRIVE program. Our technical team – composed of Intel Labs, and Security Architecture and Engineering – successfully completed Phase 1 in fall 2022, and our work was selected from a competitive field of Phase 1 DARPA DPRIVE participants to finish FHE design in the next 15-months for Phase 2.

Our work continues on Intel HERACLES, a new type of near memory computer architecture with tightly connected functional units and distributed memory that bridges the performance gap with cleartext computation, enabling the benefits of FHE deployment in next generation security solutions. Once HERACLES is developed, Microsoft, Intel’s key cloud ecosystem and homomorphic encryption partner, will lead the commercial adoption of the technology by working with the U.S. government to test the accelerator in cloud applications, including Microsoft Azure. Seoul National University in South Korea recently joined our team to advance approximate homomorphic encryption theory, algorithms, and applications on emerging computer architectures, such as HERACLES.

Adoption of FHE is currently impractical due to the enormous computation time required to perform even simple operations, according to DARPA. Performers in the DPRIVE program are developing FHE hardware accelerators with the goal to reduce computational run-time overhead by many orders of magnitude compared to software-based FHE computations on conventional CPUs. FHE enables software applications to process encrypted data without decryption. Processing encrypted data has the potential to elevate the bar of confidentiality in existing security solutions by both protecting data owner’s privacy and greatly reducing the risk of third-party data leakage.

Intel HERACLES Mitigates Computation Overheads

Slow FHE performance is caused by two factors: (1) a massive increase in the compute overheads for simple arithmetic operations (as much as six orders of magnitude for cleartext operations) and (2) a dramatic increase in data movement overheads due to operands and meta data that increase working set size by orders of magnitude. The HERACLES accelerator mitigates these overheads by adopting radically different approaches. From a computer science standpoint, HERACLES natively processes ring-polynomial arithmetic – a fundamental data type for modern FHE computation. For an engineering standpoint, HERACLES offers a massively parallel data path to address the compute overheads, a structured data fabric that increases memory bandwidth, and data movement strategies to take advantage of FHE application structure. Compute elements incorporate optimized arithmetic word size to accelerate the arithmetic, reducing ciphertext management operations to mitigate the compute explosion problem, and accommodate precision requirements for approximate homomorphic encryption. HERACLES also couples these with a set of software driven memory optimizations, including cache hierarchies that can perform shorter homomorphic encryption operations in memory while longer operations are executed in the compute elements. This streamlines the execution flow and provides further opportunities for reducing data size explosion.

Figure 1 Heracules diagram.jpg

Figure 1. Diagram of the HERACULES compute engine. Image credit: Intel Labs.

Currently, HERACLES delivers more than three orders of magnitude aggregate speedup on full workloads. By the end of the DPRIVE program in 2024, we expect that our fully featured HERACLES — optimized for use in the cloud with added reliability, availability, and serviceability (RAS) features and root of trust (RoT) — will deliver an additional order of magnitude or more of performance improvement. Future FHE applications will allow data sharing and collaboration in finance, insurance, and healthcare while maintaining compliance with privacy regulations.

Advancing Homomorphic Encryption

In March, I was honored to present the keynote speech at the 6th HomomorphicEncryption.org Standards Meeting in Seoul, hosted at Seoul National University. This open industry, government, and academic consortium is collaborating to advance the Homomorphic Encryption Standard. Today, this standard provides scheme descriptions, a detailed explanation of their security properties, and tables for secure parameters. Future versions of the standard may describe a standard API and a programming model for homomorphic encryption.

Figure 2 Ro presentation.jpg

Rosario Cammarota presenting the keynote speech, “Sun Never Sets on Intel Privacy Research,” at the 6th HomomorphicEncryption.org Standards Meeting in Seoul. Image credit: Sejun Kim, Intel Labs.

Additionally, working with industry and academic collaborators from 26 contributing countries, Intel, as member of the U.S.-delegation, is leading the creation of FHE International Standards (known as ISO/IEC WD 18033-8-FHE) on encrypted data processing primitives for FHE. The international standard defines the security model, assumptions, message spaces, ciphertext spaces, key spaces, formats, and the cryptographic mechanisms for the cryptographic schemes suitable for standardization.

Homomorphic encryption has never advanced at a faster pace, thanks to technological advances with revolutionary approaches to computing platforms and advances in theory, algorithmic, and applications evolving side-by-side with international standards and academic research.

Academic and Industry Research Presented at GOMACTech

Through Intel-sponsored research centers and industry organizations such as the Semiconductor Research Corporation (SRC), we are advancing the fields of post-quantum cryptography, lightweight cryptography, and encrypted data processing. In the past two years alone, more than 30 summer and extended interns have participated in Intel’s HE internship program. We continue to collaborate with the academic community on HE research. In March at GOMACTech 2023, the largest U.S. government conference on microelectronics, Intel’s Chris Wilkerson, principal engineer in Intel Labs and computer architect in the DPRIVE program, presented two research papers on hardware for FHE: Intel HERACLES and HEM, a joint project with UCSD. Additionally, four research papers on joint projects with university and industry collaborators on cryptography, security and privacy, and edge-AI were presented.

In total, the following six papers were presented at GOMACTech:

Intel HERACLES Homomorphic Encryption Revolutionary Accelerator with Correctness for Learning-oriented End-to-End Solutions

Intel: Chris Wilkerson, Sachin Taneja, Jeremy Casas, Wen Wang, Duhyeong Kim, Rosario Cammarota, Raghavan Kumar, Sanu Mathew, Jin Yang, Michael Steiner, Huijing Gong, Poornima Lalwaney, Adish Vartak, Vasantha Srirambhatla, Sandeep Jain, AppaRao Challagundla, and Charlotte Bonte (formerly at Intel at time of submission, now at Zama).

FHE enables software applications to process encrypted data without decryption. Processing encrypted data has the potential to elevate the bar of confidentiality in existing security solutions by both protecting data owner’s privacy and greatly reducing the risk of third-party data leakage. Unfortunately, FHE in software comes with prohibitive overheads, increasing latency by as much as six orders of magnitude on existing CPUs. Intel HERACLES is a new type of near memory computer architecture with tightly connected functional units and distributed memory, that bridges the performance gap, enabling the benefits of FHE deployment in next generation security solutions.

HEM: Memory-based Acceleration for Fully Homomorphic Encryption

UC San Diego: Minxuan Zhou, Pranav Gangwar, Yujin Nam, Arpan Dutta, and Tajana Rosing; Intel: Chris Wilkerson and Rosario Cammarota; IBM: Saransh Gupta.

FHE is a promising technique that enables arbitrary computations on encrypted data, securing many emerging cloud-based applications. However, FHE introduces significant computation overhead, which is usually several orders of magnitude slower than computation on plain data due to the explosion of both data and computation after encryption. Existing accelerators for FHE rely on large on-chip scratchpads to match the throughput of on-chip processing elements. However, the performance of these accelerators is still bounded by the off-chip memory bandwidth. The memory-bound issue is challenging because further increasing the on-chip scratchpad and memory bandwidth is not area- and energy- efficient. In this work, we propose a new FHE accelerator based on memory-based computing in emerging DRAM. The proposed memory-based accelerator exploits the highly parallel in-memory operations with specialized near-memory components to optimize the computation and data transfer patterns of FHE applications. In addition to the hardware design, we formulate the problem of mapping FHE programs onto the memory-based accelerator and propose a compiler-level optimization framework to generate an efficient data layout. We evaluate the efficiency of the proposed design on widely-used FHE applications for machine learning. Our evaluation shows that the proposed accelerator can provide up to 8.29× more throughput than state-of-the-art FHE accelerators while consuming 2.83× less power and 3.84× less chip area.

Secure AI Hardware By Design: From Cryptographic Proofs to Silicon Tape-Out

North Carolina State University: Anuj Dubey and Aydin Aysu; Intel: Rosario Cammarota.

Machine learning (ML) and artificial intelligence (AI) applications are increasingly being used in critical cyberinfrastructure. Therefore, trusted execution with AI/ML hardware is becoming a fundamental requirement for next-generation applications. Specifically, the data privacy and intellectual property protection of AI/ML models is of primary importance. New attacks have emerged that can steal this information through side-channels via observing unintentional hardware leakages such as power consumption, electromagnetic radiation, and execution time. Although such attacks and related defenses were known for cryptographic applications, their extensions to AI/ML frameworks are unknown and non-trivial. In our research, we have demonstrated new side-channel attacks on AI/ML hardware and proposed novel defenses to mitigate the vulnerability. Our solution is full stack — it encompasses all abstraction levels starting from theoretical proofs of security all the way down to a silicon chip tape-out. We developed cryptographic proofs for security, mapped those provable systems into custom hardware units, integrated them into a RISC-V architecture and micro-architecture, enhanced the compiler flow for custom instruction integration, implemented and taped-out the design on a 130 nm technology node, and demonstrated both practical and theoretical security.

Robust and Efficient Genome Sequence Matching on Emerging Processing In-Memory Platform

UC Irvine: Zhuowen Zou, Hanning Chen, and Mohsen Imani; University of Maryland: Prathyush Poduval; Intel: Rosario Cammarota.

In this paper, we propose BioHD, a novel genomic sequence searching platform based on hyper-dimensional computing (HDC) for hardware-friendly computation. BioHD transforms inherent sequential processes of genome matching to highly-parallelizable computation tasks. We exploit HDC memorization to encode and represent the genome sequences using high-dimensional vectors. Then, it combines the genome sequences to generate an HDC reference library. During the sequence searching, BioHD performs exact or approximate similarity check of an encoded query with the HDC reference library. Our framework simplifies the required sequence matching operations while introducing a statistical model to control the alignment quality. To get actual advantage from BioHD inherent robustness and parallelism, we design a processing in-memory (PIM) architecture with massive parallelism and compatible with the existing crossbar memory. Our PIM architecture supports all essential BioHD operations natively in memory with minimal modification on the array. We evaluate BioHD accuracy and efficiency on a wide range of genomics data, including COVID-19 databases. Our results indicate that PIM provides 102.8× and 116.1× (9.3× and 13.2×) speedup and energy efficiency compared to the state-of-the-art pattern matching algorithm running on GeForce RTX 3060 Ti GPU (state-of-the-art PIM accelerator).

Brain-Inspired Neural Adaptation for Dynamic and Scalable Hyperdimensional Learning

UC Irvine: Zhuowen Zou and Mohsen Imani; University of Connecticut: Farhad Imani; UC Los Angeles: Haleh Alimohamadi; Intel: Rosario Cammarota.

In the Internet of Things (IoT) domain, many applications are running machine learning algorithms to assimilate the data collected in the swarm of devices. Sending all data to the powerful computing environment, e.g., cloud, poses significant efficiency and scalability issues. A promising way is to distribute the learning tasks onto the IoT hierarchy, often referred to edge computing; however, the existing sophisticated algorithms such as deep learning are often overcomplex to run on less powerful and unreliable embedded IoT devices. Hyperdimensional computing (HDC) is a brain-inspired learning approach for efficient and robust learning on today’s embedded devices. Encoding, or transforming the input data into high-dimensional representation, is the key first step of HDC before performing a learning task. All existing HDC approaches use a static encoder; thus, they still require very high dimensionality, resulting in significant efficiency loss for the edge devices with limited resources. In this paper, we have developed NeuralHD, a new HDC approach with a dynamic encoder for adaptive learning. Inspired by human neural regeneration study in neuroscience, NeuralHD identifies insignificant dimensions and regenerates those dimensions to enhance the learning capability and robustness. We also present a scalable learning framework to distribute NeuralHD computation over edge devices in IoT systems. Our solution enables edge devices capable of real-time learning from both labeled and unlabeled data. Our evaluation on a wide range of practical classification tasks shows that NeuralHD provides 5.7× and 6.1× (12.3× and 14.1×) faster and more energy-efficient training compared to the HD-based algorithms (DNNs) running on the same platform. NeuralHD also provides 4.2× and 11.6× higher robustness to noise in the unreliable network and hardware of IoT environments as compared to DNNs.

Benchmarking Cryptographic Engine for Upcoming NIST FIPS Standard LWC

Ohio State University: Eslam Yahya Tawfik and Islam Elsadek; Analog Devices: Doug Gardner, John Ross Wallrabenstein, and Erik MacLean; Intel: Rosario Cammarota and Sohrab Aftabjahani.

IoT and resource-constrained environments necessities a new standard for cryptography as current standards require demanding resources and energy consumption. Hence, NIST has a standardization process for a LWC algorithm that shall fit in resource-constrained devices. The standardization process is concluding with 10 final candidates. The aim of this work is to benchmark the 10 candidates in a fair comparison using the same optimizations and architectures. The whole spectrum of designs and implementations (HW, HW/SW co-design and SW using a resource-constrained RISC-V processor) are designed and evaluated for all candidates. The designs are implemented and fabricated using CMOS GF22FDx technology. Results show HW implementation to enhance the throughput up to 99000x and energy efficiency up to 57000x compared to SW implementation. Moreover, Xoodyak and TinyJambu are the most energy efficient algorithms while Sparkle and Xoodyak provide the highest throughput.

Approved for public release, distribution unlimited.