Intel Presents Cutting-Edge AI Research at ICLR 2025

Scott_Bair · ‎04-24-2025

Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.

Highlights:

The Thirteenth International Conference on Learning Representations (ICLR) runs from April 24th through April 28th in Singapore.
Intel is pleased to present 10 main conference papers and five workshop papers at ICLR 2025.
Intel’s contributions include the first speculative decoding framework, which yields up to 1.82x speed-up on Vision Autoregressive Models; the first pre-training-free long context Mamba, which enables perplexity performance improvements of up to 8000x; and a novel way to train for long context via long-distance referral, among others.
Intel is leading the organization of two workshops at this year’s conference: the First Workshop on Scalable Optimizations of Efficient and Adaptive Foundation Models (SCOPE) and the AI for Accelerated Materials Discovery (AI4Mat) Workshop.

The Thirteenth International Conference on Learning Representations (ICLR) runs from April 24th through April 28th in Singapore. Intel is pleased to present 10 works at the main conference and five workshop papers at ICLR 2025, many of which are collaborative efforts with university partners. Intel’s contributions include the first speculative decoding framework, which yields up to 1.82x speed-up on Vision Autoregressive Models; the first training-free long context Mamba, which enables perplexity performance improvements of up to 8000x; and a novel way to train for long context via long-distance referral, among others.

Intel is leading the organization of two workshops at this year’s conference:

The Workshop on Scalable Optimizations for Efficient and Adaptive Foundation Models (SCOPE) is the first workshop created to capture advances in scalable, adaptive fine-tuning, calibration, and conversion to yield inference efficient quadratic and sub-quadratic foundation models, focusing on methodologies across vision, language, and multi-modal domains.
The AI for Accelerated Materials Discovery (AI4Mat) Workshop takes a comprehensive look at automated materials discovery spanning AI-guided design, synthesis, and automated material characterization, and aims to create an opportunity for deep, thoughtful discussion among researchers and highlight ongoing challenges in the field.

Both workshops will be on April 28th, 2025, and are co-located with the main conference in Singapore.

Read more about Intel’s contributions at ICLR 2025 below.

Main Conference Papers
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference
Intel, Weizmann Institute of Science, and Texas A&M University

This paper introduces distributed speculative inference (DSI), a novel inference algorithm that is provably faster than speculative inference (SI) [leviathan2023, chen2023, miao2024, sun2025, timor2025] and standard autoregressive inference (non-SI). Like other SI algorithms, DSI operates on frozen language models (LMs), requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups over non-SI—but rely on sufficiently fast and accurate drafters, which are often unavailable in practice. This work identified a gap where SI can be slower than non-SI if drafters are too slow or inaccurate. Researchers closed this gap by proving that DSI is faster than both SI and non-SI—given any drafters. DSI is therefore not only faster than SI, but also unlocks the acceleration of LMs for which SI fails. DSI leverages speculation parallelism (SP), a novel type of task parallelism, to orchestrate target and drafter instances that overlap in time, establishing a new foundational tradeoff between computational resources and latency. Simulations show that DSI is 1.29-1.92x faster than SI in single-node setups for various off-the-shelf LMs and tasks.

Fully-Inductive Node Classification on Arbitrary Graphs
Intel, Mila- Québec AI Institute, Google Research, University of Oxford, and Université de Montréal

One fundamental challenge in graph machine learning is generalizing to new graphs. Many existing methods following the inductive setup can generalize to test graphs with new structures, but assuming the feature and label spaces remain the same as the training ones. This paper introduces a fully-inductive setup, where models should perform inference on arbitrary test graphs with new structures, feature and label spaces. Researchers propose GraphAny as the first attempt at this challenging setup. GraphAny models inference on a new graph as an analytical solution to a LinearGNN, which can be naturally applied to graphs with any feature and label spaces. To further build a stronger model with learning capacity, researchers fuse multiple LinearGNN predictions with learned inductive attention scores. Specifically, the attention module is carefully parameterized as a function of the entropy-normalized distance features between pairs of LinearGNN predictions to ensure generalization to new graphs. Empirically, GraphAny trained on a single Wisconsin dataset with only 120 labeled nodes can generalize to 30 new graphs with an average accuracy of 67.26%, surpassing not only all inductive baselines, but also strong transductive methods trained separately on each of the 30 test graphs.

LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding
Intel, KAIST, and AITRICS

Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes tokens one at a time, slowing down generation compared to models like GANs or diffusion-based methods that operate more efficiently. While speculative decoding has proven effective for accelerating LLMs by generating multiple tokens in a single forward, its application in visual AR models remains largely unexplored. This work identifies a challenge in this setting, which researchers term token selection ambiguity, wherein visual AR models frequently assign uniformly low probabilities to tokens, hampering the performance of speculative decoding. To overcome this challenge, the paper proposes a relaxed acceptance condition referred to as LANTERN that leverages the interchangeability of tokens in latent space. This relaxation restores the effectiveness of speculative decoding in visual AR models by enabling more flexible use of candidate tokens that would otherwise be prematurely rejected. Furthermore, by incorporating a total variation distance bound, researchers ensure that these speed gains are achieved without significantly compromising image quality or semantic coherence. Experimental results demonstrate the efficacy of this method in providing a substantial speed-up over speculative decoding. In specific, compared to a naive application of state-of-the-art speculative decoding, LANTERN increases speed-ups by 1.75× and 1.82×, as compared to greedy decoding and random sampling, respectively, when applied to LlamaGen, a contemporary visual AR model. The code is publicly available at https://github.com/jadohu/LANTERN

MambaExtend: A Training-Free Approach to Improve Long Context Extension of Mamba
Intel and University of Southern California

The inherent quadratic complexity of the attention mechanism in transformer models has driven the research community to explore alternative architectures with sub-quadratic complexity, such as state-space models. Mamba has established itself as a leading model within this emerging paradigm, achieving stateof-the-art results in various language modeling benchmarks. However, despite its impressive performance, Mamba’s effectiveness is limited by its pre-training context length, resulting in a pronounced degradation when the model is tasked with handling longer contexts. Our investigation reveals that Mamba’s inability to generalize effectively to long contexts is primarily due to the out-of-distribution (OOD) discretization steps. To address this critical limitation, we introduce MambaExtend, a novel framework designed to significantly enhance the context extension capabilities of Mamba. Specifically, MambaExtend leverages a training-free approach to calibrate only the scaling factors of discretization modules for different layers. We demonstrate both gradient-based and gradient-free zeroth-order optimization to learn the optimal scaling factors for each Mamba layer, requiring orders of magnitude fewer updates as opposed to the parameter fine-tuning-based alternatives. Using this approach, we achieve a training-free context extension of up to 32×, expanding the context from 2k to 64k tokens with minimal increases in perplexity. In contrast to existing fine-tuning methods, MambaExtend selectively calibrates the scaling factors, requiring up to ∼5.42 ∗ 106× fewer parameter updates and incurring up to 3.87× lower peak memory usage, while delivering comparable or superior long-context performance across multiple tasks. Codes available here: https://github.com/ArminAzizi98/LongContextMamba

MatExpert: Decomposing Materials Discovery by Mimicking Human Experts
Intel and Université de Montréal

Material discovery is a critical research area with profound implications for various industries. This work introduces MatExpert, a novel framework that leverages Large Language Models (LLMs) and contrastive learning to accelerate the discovery and design of new solid-state materials. Inspired by the workflow of human materials design experts, this approach integrates three key stages: retrieval, transition, and generation. First, in the retrieval stage, MatExpert identifies an existing material that closely matches the desired criteria. Second, in the transition stage, MatExpert outlines the necessary modifications to transform this material formulation to meet specific requirements outlined by the initial user query. Third, in the generation state, MatExpert performs detailed computations and structural generation to create new materials based on the provided information. Experimental results demonstrate that MatExpert outperforms state-of-the-art methods in material generation tasks, achieving superior performance across various metrics including validity, distribution, and stability. As such, MatExpert represents a meaningful advancement in computational material discovery using langauge-based generative models.

QPM: Discrete Optimization for Globally Interpretable Image Classification
Intel and Institute for Information Processing

Understanding the classifications of deep neural networks used in safety-critical situations is becoming increasingly important. Compactness, which refers to features per class and in total, is the leading metric to gauge the interpretability of such networks. To ideally use the desired compactness, this paper introduces the Quadratic Programming Enhanced Model (QPM) which uses discrete optimization to find an optimal feature selection and binary assignment between features and classes based on predefined similarity measures and interpretability constraints. The resulting QPM offers local and global interpretability across small and large-scale datasets, since every class is recognized by very few features. QPM shows Structural Grounding and the resulting features are class-independent, diverse and contrastive. Formulating the task as a quadratic problem, and solving it optimally, leads to an ideal use of the given capacity, elevating QPM to a new state-of-the art for interpretable models, even maintaining 98% of the uninterpretable baselines' accuracy on ImageNet-1K. Finally, the quadratic problem allows the practitioner to steer the optimization towards appropriate biases.

Scaling Long Context Training Data by Long-Distance Referrals
Intel, Carnegie Mellon University, University of California San Diego, and Mohamed bin Zayed University of Artificial Intelligence

Training large language models for long context understanding faces the challenge of data shortage. Previous data engineering approaches mechanically concatenate short documents, which may create many pseudo long documents but raise concerns about data quality. In this paper, researchers study the core attribute of high-quality data for long context training, and provide a data pipeline, LongPack, to scale such data. Results showed that long-distance referrals, which occur in natural long documents, are crucial for long-context training. However, simply concatenating short documents does not reliably generate these relations. The work further demonstrates that the density of long-distance referrals, which is higher in longer documents, has a key role in training efficiency, making previous upsampling methods suboptimal. To enrich long documents, researchers propose LongPack, a data pipeline that constructs long documents by packing shorter ones based on referral relationships. Specifically, for web pages, which are the primary source for language model training, researchers found hyper-link a native signal for such a relation. By packing web pages through their hyper-link connection, researchers can create longer, high-quality documents. Experiments demonstrate that LongPack is highly scalable, generating a corpus of long documents equivalent in size to an entire pretraining dataset using just 0.5% root documents. Furthermore, the constructed documents have a ‘near-natural’ quality as innate long documents for long context training, reaching a 32.7% higher score than previous state-of-the-art methods.

Stiefel Flow Matching for Moment-Constrained Structure Elucidation
Intel and University of Toronto

Molecular structure elucidation is a fundamental step in understanding chemical phenomena, with applications in identifying molecules in natural products, lab syntheses, forensic samples, and the interstellar medium. This work considers the task of predicting a molecule's all-atom 3D structure given only its molecular formula and moments of inertia, motivated by the ability of rotational spectroscopy to measure these moments. While existing generative models can conditionally sample 3D structures with approximately correct moments, this soft conditioning fails to leverage the many digits of precision afforded by experimental rotational spectroscopy. To address this, the paper first shows that the space of -atom point clouds with a fixed set of moments of inertia is embedded in the Stiefel manifold. The researchers then propose Stiefel Flow Matching as a generative model for elucidating 3D structure under exact moment constraints. Additionally, they learn simpler and shorter flows by finding approximate solutions for equivariant optimal transport on the Stiefel manifold. Empirically, enforcing exact moment constraints allows Stiefel Flow Matching to achieve higher success rates and faster sampling than Euclidean diffusion models, even on high-dimensional manifolds corresponding to large molecules in the GEOM dataset.

Retri3D: 3D Neural Graphics Representation Retrieval
Intel and University of Toronto
Presented as a Spotlight at ICLR 2025

Learnable 3D Neural Graphics Representations (3DNGR) have emerged as promising 3D representations for reconstructing 3D scenes from 2D images. Numerous works, including Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and their variants, have significantly enhanced the quality of these representations. The ease of construction from 2D images, suitability for online viewing/sharing, and applications in game/art design downstream tasks make it a vital 3D representation, with potential creation of large numbers of such 3D models. This necessitates large data stores, local or online, to save 3D visual data in these formats. However, no existing framework enables accurate retrieval of stored 3DNGRs. This work proposes Retri3D, a framework that enables accurate and efficient retrieval of 3D scenes represented as NGRs from large data stores using text queries. Researchers introduce a novel Neural Field Artifact Analysis technique, combined with a Smart Camera Movement Module, to select clean views and navigate pre-trained 3DNGRs. These techniques enable accurate retrieval by selecting the best viewing directions in the 3D scene for high-quality visual feature embeddings. Results demonstrate that Retri3D is compatible with any NGR representation. On the LERF and ScanNet++ datasets, the framework shows significant improvement in retrieval accuracy compared to existing techniques, while being orders of magnitude faster and storage efficient.

SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models
Intel, McGill University, and University of North Carolina at Charlotte

Generating novel crystalline materials has the potential to lead to advancements in fields such as electronics, energy storage, and catalysis. The defining characteristic of crystals is their symmetry, which plays a central role in determining their physical properties. However, existing crystal generation methods either fail to generate materials that display the symmetries of real-world crystals, or simply replicate the symmetry information from examples in a database. To address this limitation, this paper proposes SymmCD, a novel diffusion-based generative model that explicitly incorporates crystallographic symmetry into the generative process. Researchers decompose crystals into two components and learn their joint distribution through diffusion: 1) the asymmetric unit, the smallest subset of the crystal which can generate the whole crystal through symmetry transformations, and; 2) the symmetry transformations needed to be applied to each atom in the asymmetric unit. The work also uses a novel and interpretable representation for these transformations, enabling generalization across different crystallographic symmetry groups. Results showcase the competitive performance of SymmCD on a subset of the Materials Project, obtaining diverse and valid crystals with realistic symmetries and predicted properties.

Workshop Papers

Context Is All You Need: Efficient Retrieval Augmented Generation for Domain Specific AI
ICLR - SCOPE Workshop

Effective Retrieval-Augmented Generation (RAG) pipelines face significant challenges when processing domain-specific technical documents containing diverse content types like text, figures, equations, and tables. This work introduces CoRAG, Context-oriented RAG for domain-specific applications, which enhances contextual understanding through a lightweight, two-pipeline architecture: Content Analysis & Enrichment for structured metadata extraction, and Query Processing for context-aware retrieval. This approach emphasizes preserving structural relationships and semantic connections across different modalities, enabling more precise technical information retrieval. LLM dataset of complex Telco 3GPP technical specifications, CoRAG achieves 77.00% accuracy while using smaller models than current state-of-the-art methods, establishing a new benchmark for telco-RAG applications. The system’s efficient design and comprehensive context handling make advanced RAG capabilities more accessible for domain-specific use while maintaining high performance across varying levels of technical complexity.

LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models
Intel and KAIST
ICLR - SCOPE Workshop (Oral)

Speculative decoding has been widely used to accelerate autoregressive (AR) text generation. However, its effectiveness for visual AR models remains limited due to token selection ambiguity, where multiple tokens share similarly low probabilities and thus reduce acceptance rates. Recently, relaxed speculative decoding with dynamic tree drafting was proposed to mitigate this ambiguity, demonstrating promising results in accelerating visual AR models. Nevertheless, the researchers observed that token selection ambiguity continues to negatively affect dynamic tree drafting, resulting in shallow draft trees and limited acceleration. To overcome this issue, this work introduces LANTERN++, a novel framework that integrates static tree drafting with a relaxed acceptance condition, allowing drafts to be selected independently of low-confidence predictions. This enables the acceptance of deeper sequences, improving decoding efficiency while preserving image quality. Extensive experiments on state-of-the-art visual AR models demonstrate that LANTERN++ significantly accelerates inference, achieving up to speedup over standard AR decoding while maintaining high image quality.

Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2
Intel, University of Groningen, and University of California, Santa Cruz
ICLR – SCOPE Workshop

Large language models (LLMs) deliver impressive performance but require large amounts of energy. This work presents a MatMul-free LLM architecture adapted for Intel’s neuromorphic processor, Loihi 2. This approach leverages Loihi 2’s support for low-precision, event-driven computation and stateful processing. Our hardware-aware quantized model on GPU demonstrates that a 370Mparameter MatMul-free model can be quantized with no accuracy loss. Preliminary results demonstrate up to 3× higher throughput with 2× less energy, compared to transformer-based LLMs on an edge GPU, with significantly better scaling. Further hardware optimizations will increase throughput and decrease energy consumption. These results show the potential of neuromorphic hardware for efficient inference and pave the way for efficient reasoning models capable of generating complex, long-form text rapidly and cost-effectively.

QMambaExtend: Improving Long-Context Extension of Memory-Efficient Mamba Models
Intel and University of Southern California
ICLR - SCOPE Workshop

Despite its impressive sub-quadratic compute efficiency, Mamba's effectiveness is significantly limited by its pre-training context length, resulting in a pronounced degradation when the model is tasked with handling longer contexts. This may be attributed to the out-of-distribution (OOD) discretization steps of Mamba on longer contexts. To address this critical limitation, this paper introduces MambaExtend, a novel framework designed to enhance the context extension capabilities of Mamba. Specifically, MambaExtend leverages a training-free approach to calibrate only the scaling factors of discretization modules for different layers. To further enhance the model efficiency while improving their long context understanding, researchers benchmarked the performance of quantized model variants, namely QMambaExtend. With enables, for the first time, a training-free context extension of up to 32x from 2k to 64k, that also provides a reduction of up to 2.1x in the weight memory footprint.

XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units
Intel and Purdue University
ICLR – SCOPE Workshop

State-Space Models (SSMs) have emerged as efficient alternatives to transformers for sequential data tasks, offering linear or near-linear scalability with sequence length, unlike transformers with quadratic-complexity attention. This makes SSMs ideal for long-sequence tasks in natural language processing (NLP), vision, and edge AI applications such as real-time transcription, translation, and contextual search. These applications demand lightweight, high-performance models for deployment on resource-constrained devices like laptops and PCs. While specialized accelerators have been proposed for emerging neural networks, designing new hardware is time-intensive, costly, and impractical for every model. Instead, optimizing models for existing neural processing units (NPUs) in AI PCs offers a scalable and efficient solution. Towards this end, this paper proposes XAMBA, the first framework to enable and optimize SSMs on commercial off-the-shelf (COTS) state-of-the-art (SOTA) NPUs. This approach follows a systematic three-step methodology: (1) enabling SSMs on NPUs, (2) optimizing performance to meet target Key Performance Indicator (KPI) requirements, and (3) trading accuracy for additional performance gains. After enabling SSMs on NPUs, XAMBA addresses key performance bottlenecks with two techniques: CumBA and ReduBA. These replace sequential CumSum and ReduceSum operations with matrix-based computations, significantly improving execution speed and memory efficiency. In addition, ActiBA further enhances performance by mapping computationally expensive activation functions (e.g., Swish, Softplus) to NPU hardware using piecewise linear approximations, reducing latency with minimal accuracy loss. Experimental evaluations on an Intel® Core™ Ultra Series 2 AI PC demonstrate that XAMBA achieves significant performance improvements, reducing execution latency by up to 4.8× compared to baseline implementation. The implementation is available at https://github.com/arghadippurdue/XAMBA.