Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. ProactiveVideoQA Benchmark

ProactiveVideoQA Benchmark

Benchmark testing whether models answer questions at the right visual moment without explicit prompting.

Year: 2026Generality: 480Added: May 12, 2026
Back to Vocab

ProactiveVideoQA is a benchmark evaluating whether AI models can produce concise verbal answers at the appropriate moment in a streaming video — specifically, whether models can recognize when new visual information in a video answers a question and speak up at that moment, rather than waiting for an explicit prompt. The benchmark takes a question ("what color is the car?"), presents the question in audio, then streams a video that contains the answer at a specific moment. The instruction to the model is: "watch the video and stay quiet until a new moment answers the question. When one happens, say a concise answer." The model's score depends on both whether it produced the correct answer and whether it did so at the right moment.

The evaluation is technically demanding because it requires the model to maintain a representation of the question across the entire video stream while simultaneously tracking visual content. The model must recognize the semantic relevance of visual events to the question — not just that something happened, but that this specific visual event resolves the uncertainty the question describes. It must then decide whether to speak based on this assessment, and time the response to coincide with the relevant visual moment rather than before or after.

The metric reported is a turn-weighted PAUC@ω=0.5 (probability of audible utterance at threshold 0.5), scaled to 0-100, averaged across turns and categories. Staying silent scores 25.0, meaning that the baseline of never speaking incorrectly earns a quarter of the scale. Higher scores require correct answers at correct times; incorrect answers are penalized. Current models — including frontier commercial real-time APIs — perform near this baseline, indicating that proactive visual QA is a genuine unsolved problem.

The benchmark's design reflects the broader recognition that the ability to volunteer information proactively, rather than only in response to explicit questions, is what distinguishes an intelligent collaborator from a sophisticated search engine. ProactiveVideoQA was created by Thinking Machines Lab as part of their broader effort to develop evaluation frameworks for the capabilities that matter most in real-time interactive AI.