ProactiveVideoQA Benchmark

ProactiveVideoQA is a benchmark evaluating whether AI models can produce concise verbal answers at the appropriate moment in a streaming video — specifically, whether models can recognize when new visual information in a video answers a question and speak up at that moment, rather than waiting for an explicit prompt. The benchmark takes a question ("what color is the car?"), presents the question in audio, then streams a video that contains the answer at a specific moment. The instruction to the model is: "watch the video and stay quiet until a new moment answers the question. When one happens, say a concise answer." The model's score depends on both whether it produced the correct answer and whether it did so at the right moment.

The evaluation is technically demanding because it requires the model to maintain a representation of the question across the entire video stream while simultaneously tracking visual content. The model must recognize the semantic relevance of visual events to the question — not just that something happened, but that this specific visual event resolves the uncertainty the question describes. It must then decide whether to speak based on this assessment, and time the response to coincide with the relevant visual moment rather than before or after.

The metric reported is a turn-weighted PAUC@ω=0.5 (probability of audible utterance at threshold 0.5), scaled to 0-100, averaged across turns and categories. Staying silent scores 25.0, meaning that the baseline of never speaking incorrectly earns a quarter of the scale. Higher scores require correct answers at correct times; incorrect answers are penalized. Current models — including frontier commercial real-time APIs — perform near this baseline, indicating that proactive visual QA is a genuine unsolved problem.

The benchmark's design reflects the broader recognition that the ability to volunteer information proactively, rather than only in response to explicit questions, is what distinguishes an intelligent collaborator from a sophisticated search engine. ProactiveVideoQA was created by Thinking Machines Lab as part of their broader effort to develop evaluation frameworks for the capabilities that matter most in real-time interactive AI.