Turn-Weighted PAUC

Turn-Weighted PAUC@ω=0.5 (Turn-Weighted Probability of Audible Utterance at threshold 0.5) is an evaluation metric used in ProactiveVideoQA and similar streaming video benchmarks to measure whether an AI model speaks at appropriate moments in response to visual events. The metric captures both whether the model spoke when it should have (sensitivity to relevant visual moments) and whether it refrained from speaking when it should not have (specificity against irrelevant moments), weighted by the turn structure of the interaction.

The PAUC component measures the probability of audible utterance at a given confidence threshold across multiple examples. Rather than simply measuring accuracy at a single moment, it evaluates the model's behavior across the full distribution of possible moments — did the model produce speech that was appropriately timed to the visual events in the video? The "turn-weighted" variant adjusts for the fact that some visual events are more important than others for the given question, so a model that speaks at an unimportant moment should be penalized less than one that fails to speak at a critical moment.

The metric is scaled 0-100, where 25.0 represents the baseline of always staying silent (the model's default behavior when it does not understand the task). A score of 50.0 indicates roughly random performance. Scores above 75.0 indicate genuinely useful proactive behavior. Current models tested on ProactiveVideoQA score near the baseline of 25.0, confirming that proactive visual QA is unsolved.

Turn-Weighted PAUC is notable because it operationalizes a concept that is usually left implicit in AI evaluation: the appropriateness of timing. Most AI benchmarks measure whether an output is correct, independent of when it was produced. For turn-based systems, timing is not a variable — outputs always follow inputs. But for interaction models, timing is a first-class dimension of quality, and metrics like Turn-Weighted PAUC are attempts to quantify it rigorously.