Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.
Highlights:
- The International Conference on Machine Learning (ICML) 2025 runs from July 13th through July 19th in Vancouver, Canada.
- Intel Labs is excited to present six works at the main conference, including two spotlight papers, one of which is also an oral presentation. Intel researchers will also present two works at related conference workshops.
- Intel’s oral presentation details three new speculative decoding methods that preserve the target distribution and work with off-the-shelf models without requiring additional training or modifications. The algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding on summarization, programming, and long-context tasks.
The International Conference on Machine Learning (ICML) 2025 runs from July 13th through July 19th in Vancouver, Canada. This year marks the forty-second iteration of the conference, bringing industry and academic professionals from around the world to present the latest in cutting-edge machine learning research; conference presentations span a variety of other associated topics as well, including artificial intelligence, statistics and data science, machine vision, computational biology, speech recognition, and robotics.
Intel Labs is excited to present six works at the main conference, including two spotlight papers, one of which is also an oral presentation. Intel’s oral presentation details three new speculative decoding methods that preserve the target distribution and work with off-the-shelf models without requiring additional training or modifications. The algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding on summarization, programming, and long-context tasks. This work substantially broadens the applicability of the SD framework in practice. Some of Intel’s other works at the conference include a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs; and a novel design that maintains two specialized hidden states, thereby mitigating the short-range bias typical of linear-attention architectures.
Intel researchers will also present two works at related conference workshops: a framework that automates mapper development with generative optimization, and a framework designed to facilitate composite AI safety workflows. Read below to learn more about all of Intel’s contributions at ICML 2025.
Main Conference Contributions
A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment
Are generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learning a world model from which sequences are generated one token at a time? This work addresses the question by deriving a causal interpretation of the attention mechanism in GPT and presenting a causal world model that arises from this interpretation. Furthermore, this paper proposes that GPT models, at inference time, can be utilized for zero-shot causal structure learning for input sequences, and introduce a corresponding confidence score. Empirical tests were conducted in controlled environments using the setups of the Othello and Chess strategy games. A GPT, pre-trained on real-world games played with the intention of winning, was tested on out-of-distribution synthetic data consisting of sequences of random legal moves. Results show that the GPT model is likely to generate legal next moves for out-of-distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence. In cases where it generates illegal moves, it also fails to capture a causal structure.
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Oral Presentation
Spotlight Paper
Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. This work presents three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, these algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding. By enabling any off-the-shelf model to serve as a drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.
Morse: Dual-Sampling for Lossless Acceleration of Diffusion Models
This paper presents Morse, a simple dual-sampling framework for accelerating diffusion models losslessly. The key insight of Morse is to reformulate the iterative generation (from noise to data) process via taking advantage of fast jump sampling and adaptive residual feedback strategies. Specifically, Morse involves two models called Dash and Dot that interact with each other. The Dash model is just the pre-trained diffusion model of any type, but operates in a jump sampling regime, creating sufficient space for sampling efficiency improvement. The Dot model is significantly faster than the Dash model, which is learnt to generate residual feedback conditioned on the observations at the current jump sampling point on the trajectory of the Dash model, lifting the noise estimate to easily match the next-step estimate of the Dash model without jump sampling. By chaining the outputs of the Dash and Dot models run in a time-interleaved fashion, Morse exhibits the merit of flexibly attaining desired image generation performance while improving overall runtime efficiency. With the proposed weight sharing strategy between the Dash and Dot models, Morse is efficient for training and inference. The proposed method shows a lossless speedup of 1.78× to 3.31× on average over a wide range of sampling step budgets relative to 9 baseline diffusion models on 6 image generation tasks. Furthermore, results show that the method can be also generalized to improve the Latent Consistency Model (LCM-SDXL, which is already accelerated with consistency distillation technique) tailored for few-step text-to-image synthesis. The code and models are available here.
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. This work proposes dual-state linear attention (DSLA), a novel design that maintains two specialized hidden states—one for preserving historical context and one for tracking recency—thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, this paper introduces DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. DSLA-Serve uses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that DSLA-Serve yields 2.3× faster inference than Llama2-7B and 3.0× faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Ablation studies show that DSLA’s dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions.
PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop
Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. Results show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, researchers found that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and results are further improved through a novel reward modeling procedure introduced in the paper. The study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, the work puts forth a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development. The code is available here.
SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Spotlight Paper
Multimodal retrieval-augmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where models should effectively integrate additional knowledge to generate a response. However, existing vision and language models (VLMs) are not inherently designed for context-augmented generation, limiting their effectiveness in such tasks. While synthetic data generation has recently gained attention for training large VLMs, its application for context-augmented generation remains underexplored. To address this gap, this paper introduces SKVQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs, each associated with external knowledge sources to determine the final answer. Compared to previous datasets, SKVQA exhibits 11× more unique questions, greater domain diversity, and a broader spectrum of image sources. Human evaluations confirm the high quality of the generated question-answer pairs and their contextual relevance. Extensive experiments show that SKVQA serves both as a challenging benchmark for knowledge-based VQA and as an effective training resource for adapting generative multimodal models to context-augmented generation. Results further indicate that models trained on SKVQA demonstrate enhanced generalization in both context-aware VQA and multimodal RAG settings.
Workshop Papers
BlueGlass: A Framework for Composite AI Safety
Workshop on Actionable Interpretability
As AI systems become increasingly capable and ubiquitous, ensuring the safety of these systems is critical. However, existing safety tools often target different aspects of model safety and cannot provide full assurance in isolation, highlighting a need for integrated and composite methodologies. This paper introduces BLUEGLASS, a framework designed to facilitate composite AI safety workflows by providing a unified infrastructure enabling the integration and composition of diverse safety tools that operate across model internals and outputs. Furthermore, to demonstrate the utility of this framework, we present three safety-oriented analyses on vision-language models for the task of object detection: (1) istributional evaluation, revealing performance tradeoffs and potential failure modes across distributions; (2) probe-based analysis of layer dynamics highlighting shared hierarchical learning via phase transition; and (3) sparse autoencoders identifying interpretable concepts. More broadly, this work contributes foundational infrastructure and findings for building more robust and reliable AI systems.
Improving Parallel Program Performance with LLM Optimizers via Agent-System Interface
Programmatic Representations for Agent Learning Workshop
Modern scientific discovery increasingly relies on high-performance computing for complex modeling and simulation. A key challenge in improving parallel program performance is efficiently mapping tasks to processors and data to memory, a process dictated by intricate, low-level system code known as mappers. Developing high-performance mappers demands days of manual tuning, posing a significant barrier for domain scientists without systems expertise. This work introduces a framework that automates mapper development with generative optimization, leveraging richer feedback beyond scalar performance metrics. The approach features the Agent-System Interface, which includes a Domain-Specific Language (DSL) to abstract away low-level complexity of system code and define a structured search space, as well as AutoGuide, a mechanism that interprets raw execution output into actionable feedback. Unlike traditional reinforcement learning methods such as OpenTuner, which rely solely on scalar feedback, this method finds superior mappers in far fewer iterations. With just 10 iterations, it outperforms OpenTuner even after 1000 iterations, achieving 3.8 times faster performance. This approach finds mappers that surpass expert-written mappers by up to 1.34 times speedup across nine benchmarks while reducing tuning time from days to minutes.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.