Intel Labs Presents Leading Multimodal and Agentic Research at CVPR 2025

Scott_Bair · ‎06-11-2025

Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.

Highlights:

This year’s IEEE/CVF Conference on Computer Vision and Pattern Recognition runs from June 11th through 15th in Nashville, Tennessee.
Intel is presenting a tutorial detailing best-in-world multimodal foundation models developed by Intel Labs researchers on Intel hardware.
Intel researchers won the EgoExo4D Fine-grained Keystep Recognition Challenge.
Intel is proud to be a Platinum sponsor of this year’s event and will have a meet-and-greet with researchers and demos available at booth 907.
Intel Labs researchers will also present eleven papers at conference workshops. These works include a framework for systematic hierarchical analysis of vision model representations; a flexible graph-learning framework for fine-grained keystep recognition; and a novel interpretability metric that measures how consistently individual attention heads in CLIP models align with specific concepts.

The 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) runs from June 11th through 15th in Nashville, Tennessee. CVPR is the top Engineering and Computer Science conference, and Intel is honored to present a tutorial at this year’s event.

Intel’s tutorial – Cognitive AI for the Future: Agentic Multimodal Models and RAG for Vision Language Applications, from Training to Deployment – details best-in-world multimodal foundation models developed by Intel Labs researchers on Intel hardware. The tutorial will also detail the efficient deployment of these models through Agentic solutions on Intel’s client offerings – including AI PC and discrete GPUs – and datacenter offerings, such as Gaudi 2 and Gaudi 3. For a preview of this tutorial, please our demonstration video from Intel Vision 2025 Selects, Agentic Computing for AI Agents

This year, one of Intel Labs’ works presents a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. This work outperforms existing methods by more than 12 points in accuracy and was the winning submission on the EgoExo4D Fine-grained Keystep Recognition Challenge.

We are also pleased to highlight Intel Labs research scientist Subarna Tripathi, who was invited to be a mentor at the Women in Computer Vision (WiCV) workshop co-located with the main conference.

Intel is a Platinum sponsor of this year’s event and will have a meet-and-greet with researchers and demos available at booth 907. Stop by to chat and learn more about our latest developments. Furthermore, if you are at the conference, register for Intel AI’s “After.CVPR() Dev Meetup” on June 14th for some more demos, networking opportunities, and snacks.

INT_JCA849_NE_CVPR_Social_Assets 1080x1080.png
Our researchers will present ten other papers at conference workshops. These works include a framework for systematic hierarchical analysis of vision model representations; a novel framework to systematically evaluate how vision-language models encode cultural differences and biases; and a novel interpretability metric that measures how consistently individual attention heads in CLIP models align with specific concepts. To learn more about these research efforts and other contributions, read on below.

Workshop Papers

Analyze, Generate, Improve: Failure-Based Data Generation for Large Multimodal Models

Training models on synthetic data is an effective strategy for improving large multimodal models (LMMs) due to the scarcity of high-quality paired image-text data. Existing methods generate multimodal datasets but do not address specific reasoning deficiencies in LMMs. In contrast, humans learn efficiently by focusing on past failures. Inspired by this, Intel researchers propose a synthetic data generation approach that analyzes an LMM’s reasoning failures using frontier models to generate and filter high-quality examples. This method produces a 553k-example multimodal instruction tuning dataset, leading to improved LMM performance, even surpassing models trained on equivalent real data, demonstrating the high value of generating synthetic data targeted to specific reasoning failure modes in LMMs.

Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders
Oral Presentation delivered by Matthew Olson

The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this paper, Intel Researchers conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. This work extends their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. Researchers analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. The study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.

Cultural Awareness in Vision-Language Models: A Cross-Country Exploration

Vision-Language Models (VLMs) are increasingly deployed in diverse cultural contexts, yet their internal biases remain poorly understood. In this work, we propose a novel framework to systematically evaluate how VLMs encode cultural differences and biases related to race, gender, and physical traits across countries. We introduce three retrieval-based tasks: (1) Race to Country retrieval, which examines the association between individuals from specific racial groups (East Asian, White, Middle Eastern, Latino, South Asian, and Black) and different countries; (2) Personal Traits to Country retrieval, where images are paired with trait-based prompts (e.g., Smart, Honest, Criminal, Violent) to investigate potential stereotypical associations; and (3) Physical Characteristics to Country retrieval, focusing on visual attributes like skinny, young, obese, and old to explore how physical appearances are culturally linked to nations. Our findings reveal persistent biases in VLMs, highlighting how visual representations may inadvertently reinforce societal stereotypes.

Deep Geometric Moments Promote Shape Consistency in Text-to-3D Generation

To address the data scarcity associated with 3D assets, 2D-lifting techniques such as Score Distillation Sampling (SDS) have become a widely adopted practice in text-to-3D generation pipelines. However, the diffusion models used in these techniques are prone to viewpoint bias and thus lead to geometric inconsistencies such as the Janus problem. To counter this, this paper introduces MT3D, a text-to-3D generative model that leverages a high-fidelity 3D object to overcome viewpoint bias and explicitly infuse geometric understanding into the generation pipeline. Firstly, researchers employ depth maps derived from a high-quality 3D model as control signals to guarantee that the generated 2D images preserve the fundamental shape and structure, thereby reducing the inherent viewpoint bias. Next, they utilize deep geometric moments to ensure geometric consistency in the 3D representation explicitly. By incorporating geometric details from a 3D asset, MT3D enables the creation of diverse and geometrically consistent objects, thereby improving the quality and usability of our 3D representations. Find the project page and code here.

DPO Learning with LLMs-Judge Signal for Computer Use Agents

Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks. CUA have made significant progress with the advent of large vision-language models (VLMs). However, these agents typically rely on cloud-based inference with substantial compute demands, raising critical privacy and scalability concerns, especially when operating on personal devices. This work takes a step toward privacy-preserving and resource-efficient agents by developing a lightweight vision-language model that runs entirely on local machines. To train this compact agent, researchers introduce an LLM-as-Judge framework that automatically evaluates and filters synthetic interaction trajectories, producing high-quality data for reinforcement learning without human annotation. Experiments on the OS-World benchmark demonstrate that this fine-tuned local model outperforms existing baselines, highlighting a promising path toward private, efficient, and generalizable GUI agents.

EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs

Recent advancements in multimodal large language models (MLLMs) have enhanced their ability to interact with visual content, enabling users to engage with videos and images through conversational interfaces. However, MLLMs continue to face challenges with questions that demand a nuanced understanding of scene details, such as object manipulations. Existing question-answering benchmarks predominantly rely on video narrations, often resulting in questions that are either close-ended or lack spatio-temporal grounding, thereby limiting the scope of evaluation. To address this, Intel researchers introduce EASG-Bench, a novel Q&A benchmark that leverages Egocentric Action Scene Graphs (EASG) to generate structured annotations, resulting in a dataset of 1,807 Q&A pairs across five categories. An evaluation of state-of-the-art MLLMs and LLMs using this benchmark reveals that current MLLMs still struggle with questions requiring temporal reasoning. To support the reproducibility of our findings and encourage further research, the benchmark and accompanying code are accessible at the following GitHub repository: https://github.com/fpv-iplab/EASG-bench.

Keystep Recognition using Graph Neural Networks

Egocentric videos capture scenes from a wearer’s viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. This work proposes a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Intel’s approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, researchers consider each clip of each exocentric video (if available) as additional nodes. The work examines several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. Researchers perform extensive experiments on the Ego-Exo4D dataset and show that the proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. This work also presents a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.

PALADIN: Robust Neural Fingerprinting for Text-to-Image Diffusion Models

The risk of misusing text-to-image generative models for malicious uses, especially due to the open-source development of such models, has become a serious concern. As a risk mitigation strategy, attributing generative models with neural fingerprinting is emerging as a popular technique. There has been a plethora of recent work intended to address neural fingerprinting. A trade-off between the attribution accuracy and generation quality of such models has been studied extensively. None of the existing methods yet achieved 100% attribution accuracy. However, any model with less than “perfect” accuracy is practically non-deployable. This work proposes an accurate method to incorporate neural fingerprinting for text-to-image diffusion models leveraging the concepts of cyclic error-correcting codes from the literature of coding theory.

Plan-Action-Reflection: A Three-Role Agentic Framework For Computer Use Agent Task

Recent advancements in multimodal foundation models have enabled agents to perform complex computer use tasks by interpreting and interacting with graphical user interfaces. However, these agents often struggle with task decomposition and error reflection. To address these limitations, Intel researchers propose a three-role agentic framework that enhances performance through structured task planning and reflection. This framework consists of (1) a planning agent that decomposes high-level user goals into actionable sub-tasks, (2) an action agent that executes the sub-tasks via grounded multimodal actions, and (3) a reflection agent that monitors execution outcomes and provides feedback to update the plan or correct the action. Experiments on benchmark computer use tasks demonstrate that the proposed framework significantly boosts task completion rates.

Quantifying Interpretability in CLIP Models with Concept Consistency

CLIP is one of the most popular foundational models and is heavily used for many vision-language tasks. However, little is known about the inner workings of CLIP. While recent work has proposed decomposition-based interpretability methods for identifying textual descriptions of attention heads in CLIP, the implications of conceptual consistency in these text labels on interpretability and model performance has not been explored. To bridge this gap, Intel researchers studied the conceptual consistency of text descriptions for attention heads in CLIP-like models and conducted extensive experiments on six different models from OpenAI and OpenCLIP, which vary by size, type of pre-training data and patch size. This work proposes Concept Consistency Score (CCS), a novel interpretability metric that measures how consistently individual attention heads in CLIP models align with specific concepts. To assign concept labels to heads, researchers use in-context learning with ChatGPT, guided by a few manually curated examples, and validate these labels using an LLM-as-a-judge approach. The soft-pruning experiments reveal that high CCS heads are critical for preserving model performance, as pruning them leads to a significantly larger performance drop than pruning random or low CCS heads. Notably, results demonstrated that high CCS heads capture essential concepts and play a key role in out-of-domain detection, concept-specific reasoning, and video-language understanding. Moreover, researchers prove that high CCS heads learn spurious correlations amplifying social biases. These results position CCS as a powerful interpretability metric exposing the paradox of performance and social biases in CLIP models.

Why do LLaVA Vision-Language Models Reply to Images in English?

This work uncovers a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs). Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an English response, regardless of the language of the query. This paper investigates the causes of this loss with a two-pronged approach that combines extensive ablation of the design space with a mechanistic analysis of the models' internal representations of image and text inputs. Both approaches indicate that the issue stems in the language modelling component of the LLaVA model. Statistically, researchers find that switching the language backbone for a bilingual language model has the strongest effect on reducing this error. Mechanistically, this work provides compelling evidence that visual inputs are not mapped to a similar space as text ones, and that intervening on intermediary attention layers can reduce this bias. The findings provide important insights to researchers and engineers seeking to understand the crossover between multimodal and multilingual spaces, and contribute to the goal of developing capable and inclusive VLMs for non-English contexts.