Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Interaction Model

Interaction Model

AI model that handles interaction natively in real time across audio, video, and text.

Year: 2025Generality: 720Added: May 12, 2026
Back to Vocab

An interaction model is an AI system that perceives and responds in continuous real time across audio, video, and text streams, rather than following the conventional request-response pattern of turn-based models. Unlike traditional models that wait for a user to finish typing or speaking before processing, an interaction model operates continuously — processing incoming audio and video while simultaneously generating outputs, enabling the fluid back-and-forth characteristic of human conversation.

The architecture centers on micro-turns: small chunks of input (typically around 200 milliseconds) processed as concurrent streams rather than as complete user turns. This time-aligned design eliminates artificial turn boundaries and allows the model to interleave perception and generation across all modalities simultaneously. A background reasoning model handles deeper, longer-horizon work asynchronously while the interaction model remains present, answering follow-ups, processing new input, and integrating results as they arrive — giving users both responsive interaction and full intelligence without compromise.

Deploying interaction models imposes significant infrastructure demands. Continuous audio and video streams require reliable low-latency connectivity; degraded connections meaningfully worsen the experience. The computational cost of maintaining real-time presence is higher than turn-based inference. Most production AI systems today remain turn-based for reliability and simplicity, meaning organizations must weigh genuine interactivity against operational complexity. Smaller or specialized models — such as Moshi, PersonaPlex, or Nemotron VoiceChat — demonstrate that interactivity can be achieved at smaller scale, but tend to prioritize latency over intelligence benchmarks.

Long-context sessions remain an active challenge: continuous streams accumulate context quickly and require careful management. Safety calibration differs in real-time settings, demanding modality-appropriate refusals and robustness across extended conversations. How coordination between fast interaction models and deeper background reasoners should work at scale is still being defined. It is also unclear whether interaction-native capability will become commoditized infrastructure or remain a source of competitive differentiation as the field develops.