AI systems that answer natural language questions about images or videos.
Visual Question Answering (VQA) is a multimodal AI task in which a system receives an image (or video) paired with a natural language question and must produce a correct, contextually grounded answer. Unlike pure image classification or captioning, VQA demands that a model reason jointly over visual and linguistic inputs — understanding not just what is depicted, but what specific aspect of the scene the question targets. A question like "How many red objects are to the left of the chair?" requires spatial reasoning, color recognition, and object counting all at once, making VQA a demanding benchmark for general visual intelligence.
Most VQA architectures follow a common pipeline: a visual encoder (typically a convolutional neural network or, more recently, a vision transformer) extracts feature representations from the image, while a language encoder (such as an LSTM or BERT-style transformer) encodes the question. These representations are then fused — through attention mechanisms, bilinear pooling, or cross-modal transformers — and passed to a classifier or generative decoder that produces the answer. Modern large vision-language models such as CLIP, Flamingo, and GPT-4V have dramatically improved VQA performance by pretraining on massive image-text corpora, enabling richer cross-modal alignment.
The field was galvanized by the release of the VQA v1 dataset in 2015, which provided hundreds of thousands of open-ended questions over real and abstract images, along with a public leaderboard. Subsequent datasets — VQA v2, GQA, and OK-VQA — addressed biases and introduced compositional and knowledge-based reasoning challenges. These benchmarks revealed that early models often exploited statistical shortcuts rather than genuine visual understanding, spurring research into more robust reasoning architectures.
VQA has significant real-world impact. It underpins accessibility tools that help visually impaired users query their environment through a camera, powers visual search and content moderation systems, and serves as a core evaluation task for general-purpose vision-language models. Because it sits at the intersection of perception, language understanding, and reasoning, progress in VQA tends to reflect — and drive — broader advances in multimodal AI.