LVLM-Interpret: Explaining Decision-Making Processes in Large Vision-Language Models

Shaoyen_Tseng · ‎12-13-2024

Shao-Yen Tseng is a research scientist at Intel Labs, focusing on developing next generation multimodal models for generative AI.

Highlights

Understanding the internal mechanisms of large vision-language models (LVLMs) is a complex task.
LVLM-Interpret helps users understand the model's internal decision-making processes, identifying potential responsible artificial intelligence (AI) issues such as biases or incorrect associations.
The tool adapts multiple interpretability methods to LVLMs for interactive analysis, including raw attention, relevancy maps, and causal interpretation.

In the rapidly advancing field of generative AI, large vision-language models have emerged as powerful tools capable of jointly processing and interpreting both visual and textual data. These models can integrate both types of data to generate meaningful responses, making them highly versatile for a wide range of multimodal tasks, such as image captioning, visual question answering, and human-AI interaction involving visual content. Despite their impressive capabilities, understanding the underlying mechanisms of these models remains a challenging task for the responsible use of AI. To enable transparency and explainability, Intel Labs collaborated with Microsoft Research Asia on LVLM-Interpret, an interactive tool designed to enhance the interpretability of LVLMs by providing detailed visualizations of their inner workings.

By shedding light on the model's internal decision-making processes, LVLM-Interpret aids users in identifying potential issues within LVLMs, such as biases or incorrect associations between visual and textual elements. The tool contributes to increased transparency in AI models by showing how the models arrive at their conclusions. For example, Figure 1 shows how probing the model can reveal when responses are generated without relevancy, or “not looking” at the image. LVLM-Interpret can discover examples showing how the model relies on preconceptions of a visual scene to answer queries, rather than focusing on the input information. This ability to interpret model responses is crucial for building trust in AI systems, particularly in applications where understanding model behavior is essential for making well-informed decisions.

Figure 1. Visualization of relevancy heatmaps. Presented with an unchanging image of a garbage truck, the model provides contradictory responses (“Yes, the door is open” vs. “Yes, the door is closed”) based on the query’s phrasing. Relevancy maps and bar plots for open and closed tokens demonstrate higher text relevance compared to image.

Understanding LVLM-Interpret

In most use cases of generative AI today, the reasoning or supporting evidence behind generated responses is obscured. Transparency improves overall model confidence and ensures trustworthiness, which is important in high stakes domains such as healthcare or legal sectors. LVLM-Interpret provides some transparency by allowing users to explore the internal operations of LVLMs by visualizing and analyzing their attention mechanisms. The interface is designed to enhance the interpretability of image patches, which are instrumental in generating an answer, and assess the efficacy of the language model in grounding its output in the image. The tool allows users to systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities.

Key Features of LVLM-Interpret

Interactive visualization: LVLM-Interpret provides an interactive interface where users can upload images and pose multimodal queries through a chatbot-like interaction. The tool then visualizes the attention weights between image patches and textual tokens, giving a clear representation of which parts of the image and text the model is focusing on during answer generation.
Attention analysis: The tool enables users to examine raw attention values, allowing for a deeper investigation into the interactions between visual and textual tokens. This feature helps users understand which components of the visual and textual inputs are deemed most relevant by the model for generating an output.
Relevancy maps and causal graphs: LVLM-Interpret generates relevancy maps and causal graphs, offering another tool for interpreting the relevance of input image to the generated answer. These analysis methods help in identifying specific parts of the input relevant to a model's final response, as well as the causal relationship between the image patches and output.

Figure 2. Visualization of text-to-vision attention heatmaps. LVLM-Interpret provides an interactive interface for exploring the attention heatmaps at different layers and heads in the Transformer model. A user can select certain words in the model response to inspect the attention given to images when generating those words.

Transparency in LVLM Model Behavior

Tools such as LVLM-Interpret are the first steps in the quest to demystify the behavior of large vision-language models. By providing interactive and detailed visualization of model behavior, it serves as a useful tool for developers, researchers, and users to enhance the transparency and reliability of AI systems. LVLM-Interpret not only aids in interpreting and debugging models but also fosters a greater understanding of the complexities of multimodal integration, which in turn could be crucial in improving models not only for accuracy, but also for safety and trustworthiness.

LVLM-Interpret was jointly developed by Gabriela Ben Melech Stan*, Raanan Yehezkel Rohekar*, Yaniv Gurwicz*, Matthew Lyle Olson*, Anahita Bhiwandiwalla*, Estelle Aflalo*, Chenfei Wu+, Nan Duan+, Shao-Yen Tseng*, and Vasudev Lal*.

*Intel Labs

+Microsoft Research Asia