AI model that handles interaction natively in real time across audio, video, and text.
An interaction model is an AI system that perceives and responds in continuous real time across audio, video, and text streams, rather than following the conventional request-response pattern of turn-based models. Unlike traditional models that wait for a user to finish typing or speaking before processing, an interaction model operates continuously — processing incoming audio and video while simultaneously generating outputs, enabling the fluid back-and-forth characteristic of human conversation.
The architecture centers on micro-turns: small chunks of input (typically around 200 milliseconds) processed as concurrent streams rather than as complete user turns. This time-aligned design eliminates artificial turn boundaries and allows the model to interleave perception and generation across all modalities simultaneously. A background reasoning model handles deeper, longer-horizon work asynchronously while the interaction model remains present, answering follow-ups, processing new input, and integrating results as they arrive — giving users both responsive interaction and full intelligence without compromise.
Deploying interaction models imposes significant infrastructure demands. Continuous audio and video streams require reliable low-latency connectivity; degraded connections meaningfully worsen the experience. The computational cost of maintaining real-time presence is higher than turn-based inference. Most production AI systems today remain turn-based for reliability and simplicity, meaning organizations must weigh genuine interactivity against operational complexity. Smaller or specialized models — such as Moshi, PersonaPlex, or Nemotron VoiceChat — demonstrate that interactivity can be achieved at smaller scale, but tend to prioritize latency over intelligence benchmarks.
Long-context sessions remain an active challenge: continuous streams accumulate context quickly and require careful management. Safety calibration differs in real-time settings, demanding modality-appropriate refusals and robustness across extended conversations. How coordination between fast interaction models and deeper background reasoners should work at scale is still being defined. It is also unclear whether interaction-native capability will become commoditized infrastructure or remain a source of competitive differentiation as the field develops.