Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Research
  3. Wintermute
  4. Multimodal Foundation Models

Multimodal Foundation Models

Models that natively process text, images, video, audio, and code in a single architecture are enabling new categories of applications from real-time video understanding to robotic control.

Geography: Americas · North America · United States

Back to WintermuteBack to United StatesView interactive version

Multimodal foundation models — capable of processing and generating across text, images, video, audio, and code simultaneously — have become the default architecture for frontier AI. Google's Gemini 3.0, OpenAI's GPT-5, and Anthropic's Claude process mixed inputs natively rather than through bolted-on modules. This enables real-time video understanding, document analysis, and cross-modal reasoning that was impossible with text-only systems.

The practical impact is enormous: these models can watch a manufacturing line and identify defects, analyze medical imaging while reading patient records, or understand a codebase by examining both the code and its documentation. They collapse the boundary between 'seeing' and 'understanding' in ways that unlock automation across industries that deal with visual or audio information.

The convergence of modalities also matters for robotics and embodied AI — models that understand both language and visual scenes can translate human instructions into physical actions. The US leads in multimodal model development, though open-source alternatives from China and Europe are narrowing the gap on specific capabilities.

TRL
8/9Deployed
Impact
5/5
Investment
5/5
Category
Software

Book a research session

Bring this signal into a focused decision sprint with analyst-led framing and synthesis.
Research Sessions