Visual Interjections

Visual interjections are model-initiated responses triggered by observed changes or events in the video stream — not by audio or text cues. Unlike verbal interjections, which are spoken contributions during user speech, visual interjections can take any output form: the model might speak, generate text, produce a UI element, or trigger a tool. The defining characteristic is that the trigger originates from the visual channel rather than from any explicit user request or verbal cue. For example, if a user asks the model to "tell me when I write a bug in my code" while sharing their screen, the model must observe the code being written and respond proactively when it detects a bug — without the user pausing or explicitly prompting the response.

This capability is impossible in systems that process audio and video through separate harnesses and only respond to complete audio turns. A voice-activity-detection-based dialog manager that waits for silence before invoking the model cannot respond to a visual event that occurs during active speech. The model must be continuously processing the video stream in parallel with all other inputs and outputs, which requires the same time-aligned micro-turn architecture that enables full-duplex audio interaction.

The most capable demonstrations of visual interjections include: proactive notification when a specific visual event occurs (counting reps while someone exercises, alerting when a chart shows a specific pattern), real-time commentary on shared visual content (sports game narration, live tutorial guidance), and visual error correction (flagging a miswritten line of code as it appears). The RepCount-A and ProactiveVideoQA benchmarks are designed to measure this capability, and current frontier models perform near-zero on these benchmarks — indicating that this is a genuine capability gap rather than a solved problem.

Visual interjections raise the stakes for safety and calibration. A model that watches a video stream continuously is in a fundamentally different relationship with the user's visual environment than a model that only processes explicit requests. The potential for unintended observation, the question of when the model should and should not comment on what it sees, and the calibration of proactive versus reactive behavior in the visual domain are all active areas of research.