How Intel Creates Better AI Video Understanding with Scene Graph Technology

Tz-Ying_Wu · ‎09-24-2025

Co-authors Tz-Ying Wu, Sharath Nittur Sridhar, and Subarna Tripathi are research scientists at Intel focusing on multimodal AI and video understanding.

Highlights

Intel researchers developed EASG-Bench, a new benchmark with more than 1,800 question-answer pairs that tests how well AI models understand what happens in first-person videos by using structured scene graphs instead of narrative video descriptions.
The research reveals a surprising gap where AI models designed specifically for video analysis actually perform worse than text-only LLM models when answering questions about the timing and sequence of events in videos.
A novel two-step questioning approach significantly improved AI performance on temporal reasoning tasks, pointing toward new methods for helping AI better understand the order of events in long-form videos.

Researchers from Intel and the University of Catania have developed EASG-Bench, a novel benchmark that reveals critical gaps in how language-only large language models (LLMs) and video-LLMs understand long-form video content, particularly when tracking sequences of events over time. This innovative testing framework uses structured scene graphs instead of traditional narrative text descriptions to evaluate the ability of multimodal generative artificial intelligence (GenAI) models to comprehend egocentric or first-person videos captured by wearable cameras. The team's novel approach uses egocentric action scene graphs (EASG) to create more than 1,800 question-answer pairs across 221 video clips that capture actions and the relationship between the camera wearer and objects. Available on GitHub, the research could potentially have real-world applications across industries that rely on video analysis, from manufacturing quality control and security monitoring to healthcare patient observation and autonomous vehicle development.

The research uncovered a surprising finding that challenges assumptions about AI video processing: specialized video AI models actually performed worse than text-only language models when answering questions about event sequences. This counterintuitive result suggests that when video-LLMs are tuned to accommodate a new modality, their reasoning abilities may be diminished. However, the team discovered that a two-stage chain-of-thought prompting approach that explicitly captures temporal ordering could improve performance, bridging much of the gap between video and language-only models.

The Challenge of Understanding Video Sequences

Current AI video analysis systems struggle to understand the temporal flow of events in long complex videos, particularly egocentric footage captured from a person's point of view. Traditional benchmarks rely on simple narrations that fail to capture the intricate relationships between objects, actions, and timing, creating significant gaps for applications requiring precise understanding of what happened when, such as manufacturing monitoring, security analysis, and activity recognition. While video-LLMs can identify objects present in videos, they often miss crucial details about sequence and context, failing to understand the meaningful relationships and causal chains that define human activities and their interactions with environments.

A Systematic Approach Using Scene Graphs

To address these limitations, the research team developed a systematic approach using egocentric action scene graphs as the foundation for creating more challenging and realistic video understanding tasks. Think of scene graphs as detailed maps that show not just what objects appear in a video, but exactly how they connect to each other and to the person performing actions.

The team used text-only LLMs to systematically generate the following types of questions from scene graphs: purpose questions that explore why objects are used, direct object questions focused on primary objects manipulated during an action, indirect object questions centered on secondary elements present during interactions, and temporal ordering questions that test understanding of event sequences.

Each question undergoes a rigorous two-stage filtering process to ensure it can only be answered by directly observing the video content, eliminating ambiguity and preventing multiple valid responses. This careful process resulted in 1,807 high-quality question-answer pairs that test an AI system's ability to understand video content at a deep level.

Figure 1 EASG Benchmarking Results Intel Labs.png

Figure 1. Language-only models outperform video models in before/after ordering of events questions in EASG-Bench.

Researchers evaluated a range of language-only and video-LLM models on the benchmark. Notably, video models such as Qwen2.5-VL-7B demonstrate effective use of visual signals and consistently outperform language-only baselines across most question types. However, video-LLMs struggle with temporal comprehension tasks, particularly those involving reasoning over events occurring before or after in sequencing. As Figure 1 shows, the best language-only model achieved scores of 98.92% on ordering (before) questions, while the top-performing video model reached only 82.76% on the same tasks.

Figure 2 EASG Bench chain of thought prompting on temporal questions Intel Labs.png

Figure 2. Effect of chain-of-thought prompting on temporal order questions (before and after type) with Qwen2.5-VL-7B.

After using the chain-of-thought prompting approach of breaking temporal questions into two steps by first locating the relevant action and then asking about what comes before or after, video model performance improved by 4.32 points on average, closing the gap with language-only systems (see Figure 2).

Overall, in other areas such as questions concerning object manipulation (direct/indirect), video-LLMs significantly outperform the language-only models, as these questions need video context for an accurate response (see Figure 1). Nonetheless, EASG-Bench is still challenging for all the existing models.

Looking Forward: The Future of Video AI

This work represents an important step toward AI that doesn't just see what's happening in videos, but also understands the deeper patterns of how and why events unfold over time. The results highlight the necessity of future research on spatio-temporal reasoning in long-form video understanding that goes beyond textual token sequence. Understanding how things interact in location and time could improve the ability of AI models to make predictions, navigate environments, and control systems.

In terms of potential real-world applications, manufacturing facilities could use improved video AI to automatically detect when assembly line tasks may be performed out of sequence, enabling real-time quality assurance. Healthcare providers could deploy these systems to monitor patient activities and ensure proper adherence to rehabilitation protocols, while autonomous vehicles could better understand the sequential actions of pedestrians and other drivers to predict their next moves and prevent accidents.