A Journey Towards Approaching “Why” Question-Answering for Video

Subarna_Tripathi · ‎06-18-2025

Subarna Tripathi is a research scientist at Intel Labs, working in computer vision and machine learning.

Let’s take a super fast journey summarizing the strides taken in an era (2012 to 2025 period) from simple image classification to recent video-LLMs to understand how to proceed with “why” questions in video understanding.

The answering part in Q&A has changed from selecting one among many to “generating” one. Let’s start from the beginning. How visual question-answering in its earlier days worked?

Assume there is a black-box that converts an image or each frame of a video to a fixed-size vector, called embedding vector. From initial day’s AlexNet (2012) to relatively recent day’s ViT (Vision transformer, 2020) are such converters. Once projected to that space, it becomes a lot easier for a computer to separate different classes by utilizing linear separators (called classifiers). This was impossible if the same had to be done on pixel space.

Figure 1. Image representation learning using classification (field exploded with the introduction of AlexNet in 2012). Image credit: Trung Thanh Tran.

Quick note: Before deep learning era, the linear classifier was trained on human generated feature space out of those images e.g. different combinations of edge detected images to roughly create a notion of shape etc. Deep learning gave us this freedom of getting those converters directly from labeled data, i.e. image and their corresponding classes. Since the embedding vectors need to be linearly separable, it directly influences the choices of that embedding space. It’s understandable that the more data you provide, the embedding space will be better amenable for the classification task at hand. Please note, not only image classification, tasks such as object detection also uses similar classification at the core.

These embedding vectors provided simple image classification. How image question-answering worked in its initial days? You can’t expect the same answer to different questions even if the input image is the same, right? That means, at the very least the system needs to convert the question to a vector, the image to a vector and then somehow combine image vector and question vector. Each question will be a different vector, so their combination hopefully will be different, too.

Please note, the answers in this early VQA systems (introduced around 2016) were all about selecting one option among many. Meaning, they were multiple choice question answering system. VQA was trained to learn a classifier that can choose one answer over k-options. They must be trained for this particular dataset. The dataset changed? Or the options changed? You must train or at least finetune the model again. So you essentially rely on question-answer pair for each image as labels for training such models. Most of the time, the image to vector converter and the question to vector converter weren’t trained from scratch. They were kept frozen, only the combination and the final classifier is trained.

Figure 2. Visual question-answering system (the idea introduced around 2016). Image credit: Visual Question Answering.

It was a great start, but as explained it did not generalize. Any time the test data changed, you had to fine-tune the VQA model again. Only good thing about such models are that they were very easy to evaluate. You can always evaluate the accuracy of a classifier, it’s all about measuring whether the answer matches with the ground truth or not.

Now, there’s a paradigm shift in today’s video question answering system. They no longer select one answer among many, they actually generate the answer in free-form text, thanks to the power of large language models, popularly known as LLM. You no longer need to train your system simply because the number of options has changed!

Why the LLMs are so powerful? Quick recap time: LLMs are pre-trained on really really large data without labels. They are optimized for next word prediction. So they can easily leverage huge data to learn from. Now next word prediction isn’t the final application. Most often, we want the LLM to answer whatever question I use it as an input. In-context learning (ICL 2019) and Instruction tuning (IT) are two ways to get the desired output.

It’s amazing how “in-context learning” uses examples within the prompt itself at inference time (introduced in GPT-2, 2019), updating the model’s hidden states without training. On the other hand, instruction tuning (introduced in Google’s FLAN, 2022) involves fine-tuning the model’s parameter during training, using instructions and examples, and doesn’t rely on examples at inference time. However, for both in-context learning and instruction tuning, we are talking about actual application, such as question-answering supplemented by labels, not the next word prediction the language models had originally optimized for. The cool thing about this strategy is that the pre-training already makes the internal / intermediate representation so powerful, that even after tuning with “not so many” labels, they become generalizable. So, that’s about LLM. Now, how visual QnA leverage LLM?

Figure 3. Visual instruction tuning, LlaVA (2023). Image credit: LLaVA.

The core idea is to convert the image to a vector (similar to the previous section), this time the constraint is that this vector would be in a way that looks very similar to a token used in LLMs. Basically, convert the image in the language space; not much training needed. Because the LLM already has been trained for following instruction. You still need a light-weight fine-tuning on image question-answering dataset, simply because pure LLM didn’t take images as inputs. Now you need the system to take a free-form text question, an image as input and by virtue of the LLM integration, you are getting free-form textual output — your desired answers.

Until this part, we know how LLMs are being super useful for chatting with an image. Generally, such models are referred to as “multimodal-LLMs”. LlaVA is an example of one such useful and widely popular method. There exists plenty of such multimodal LLMs. They generalize well. Since they are generating answers, instead of choosing one among many, you can apply such models without fine-tuning on a new domain / on any new image (called zero-shot) — and often get decent quality free-form textual output. Of course, with fine-tuning on the new domain or dataset, the quality of answers become better.

Such amazing utility also comes at a cost, that is hallucination. Since the answers are being generated, a lot of times the generated output will just hallucinate — i.e. talk about objects and scene which may not be present in the image at all. The other issue is, since these are “generative models”, evaluation is not as straightforward as it was for classification models.

Next comes the question of interacting with videos. All the previous principles hold; however, a set of new challenges arrive. How to convert a video to the language space? Should we first sample some video frames and convert each of them into the language space as above, or convert the whole video to something like a token in the language space? Some video-LLM models pursue the first approach, while some others use the second approach — while the first one is more popular and widely used. Please note, the similar light-weight tuning is also needed on the video question answering dataset. By virtue of the generalization property of the generative models, you can apply the model on a different set of questions or videos, and still expect some decent free-form textual output and understandably with an inherent issue of hallucination.

While the video-LLM literature is getting rich, https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding. The number of Image-LLM papers probably would be at least 2-order of magnitude higher than the number of video-LLM papers.

Figure 4. Video-LLM general architecture (Video-LlaVA, 2024). Image credit: Intel Labs.

The current challenge the community is facing is that not all different kinds of questions are handled equally by the video-LLM models. Some questions can still be answered simply by looking at only one video frame. Such as “how many persons are wearing (or not wearing) safety hats in this video?” Since the answer and the underlying reasoning may utilize “grounding” (meaning putting things into the image space), we call them “spatial reasoning task”. They are very different than questions that need inherent orders of actions such as “what the person did immediately after getting off the bike?” Answers for such questions cannot be “grounded” on space / images. Video-LLM may hallucinate more and evaluating the correctness becomes even more difficult.

Please note, even in video-LLM cases, you need some labeled data on which you’ll need to train your model; meaning each sample consists of a video, one question and expected output. The question and answer can be in free-form text, and not restricted to multiple choice Q&A like the early days when visual question answering as a research field emerged only a decade ago. In the last 12-months, there has been tremendous growth in algorithms and deployment in frameworks supporting video-LLMs.

Models such as Gemini1.5 Pro and GPT-4O, from Google and OpenAI respectively, are closed-sourced. On the other hand, open-sourced models like Qwen2.5-VL from Alibaba and plenty others offer very high-quality video-LLM capability. However, we still observe that there is room to improve the spatio-temporal reasonings. We hypothesize that the power of LLMs can be harnessed in more efficient ways when needing to interact with videos, especially for such temporal understanding tasks.